the  CARTOON  CUf  DE T( 


LARRY  GONICK 


Author  of  Th*  Cartoon  History  of  th*  Universe 

&  WOOLLCOTT  SMm 


Also  by  Larry  Gonick 


The  Cartoon  History  of  the  Universe 
The  Cartoon  Guide  to  Physics  (with  Art  Huffman) 
The  Cartoon  Guide  to  the  Computer 
The  Cartoon  Guide  to  Genetics  (with  Mark  Wheelis) 
The  Cartoon  History  of  the  United  States 
The  Cartoon  Guide  to  (Non)  Communication 


THE  CARTOON  GUIDE"  TO 


LARRY  GONICK 
&  WOOLLCOTT  SMITH 


HarperPerennial 

a  arMfiM  c/  HirpnCdllios/Wu/vn 


the  cartoon  guioc  to  statistics.  Copyright  ©1993  by  Larry  GonicK  and  Woollcoll  Smith. 
All  rights  reserved  Pnnted  In  the  United  Stales  of  America  No  part  of  this  book  may  be 
used  or  reproduced  in  any  manner  whatsoever  without  written  permission  except  in  the 
case  of  brief  quotations  embodied  m  critical  articles  and  reviews  For  information  address 
HarperCollins  Publishers.  Inc..  10  East  53rd  Street,  New  York,  NY  10022 

HarperCollins  books  may  be  purchased  for  educational,  business,  or  sales  promotional 
use  For  information  please  write:  Special  Markets  Department.  HarperCollins  Publishers, 
Inc..  10  East  53rd  Street,  New  York.  NY  10022. 

FIRST  HAHPEWtntNNlAL  EDITION 

Illustrations  by  Larry  Oonick 


Library  of  Congress  Calaloging  in-Pubbcation  Data 
Gonck,  Larry 

The  cartoon  guide  to  statistics  /  Larry  Gornck  &  Wooficott  Smith 
—1st  HarperPerennlal  ed 

Includes  bibliographical  references  and  index 
ISBN  0-06-273102-5  (pbk.) 

1 .  Statistics— Caricatures  and  cartoons.  I  Smith,  WooDcott,  1941-  II  Tale 
QA276.12.G67  1993 

5195— dc20  _  _  92-54683 


97  LG/RRD  20  19  18  17  16 


♦  CONTENTS* 


CHAPTER  1 - - 

WHAT  IS  STATISTICS? 


CHAPTER  2 - 7 

PATA  PESCRIPTION 

CHAPTER  3 - 27 

PROBABILITY 

CHAPTER  4 - *3 

RAH  POM  VARIABLES 

CHAPTER  5 - 73 

A  TALE  OF  TWO  PlSTRIBUTIONS 

CHAPTER  A - 09 

SAMPLING 

CHAPTER  7 - til 

CONFIPCNCE  INTERVALS 

CHAPTER  0 - 137 

HYPOTHESIS  TESTING 

CHAPTER  9  — - 137 

COMPARING  TWO  POPULATIONS 


CHAPTER  IP - 101 

EXPERIMENTAL  PESIGN 

CHAPTER  It - 107 

REGRESSION 

CHAPTER  12 - 211 

CONCLUSION 


BIBLIOGRAPHY - 221 

INPEX - 224 


Acknowledgments 

WE  WOULP  LIKE  TO  TUANK  CAROL  COHEN  AT  HARPERCOLLINS 
FOR  SUGGESTING  THIS  PROJECT,  OUR  EPITOR  ERICA  SPABERG 
FOR  PATIENTLY  ENPURING  THE  LAST-MINUTE  PASH  TO  THE 
PEAPLIME,  AMP  VICKV  BIJUR.  OUR  LITERARY  AGENT,  FOR 
INITIATING  THE  GONICK/SMITH  COLLABORATION  BY  INTROPUC- 
ING  THE  COAUTHORS 

WILLIAM  FAIRLEYS  AMP  LEAH  SMITH'S  COMMENTS  IMPROVEP 
EARLIER  PRAFTS  OF  THIS  BOOK 

PONMA  OKINO  PROVIPEP  INVALUABLE  ASSISTANCE  ANP  APVICE 
IN  PROPUClNG  THE  CARTOON  PAGES  SHE  SAYS  THAT  CREATING 
A  CARTOON  GUIPE  IS  HARPER  THAN  RUNNING  A  MARATHON. 
ANP  SHE  SHOULP  KNCWi  SHE'S  PONE  BOTH- 

THE  ALTSYS  CORPORATION  CREATEP  FONTOGRAPHER,  THE 
WONPERFUL  SOFTWARE  THAT  ALIOWEP  US  TO  SIMULATE 
WANP-LETTCREP  TEAT  ANP  FORMUUS  ON  THE  MACINTOSH. 

ANP,  SINCE  EPUCATION  IS  ALWAYS  A  TWO-WAY  STREET,  A  TIP 
OF  THE  HAT  TO  SMITH'S  LONG-SUFFERING  TEMPLE  UNIVERSITY 
STUPENTS  ANP  ESPECIALLY  THE  FALL  92  STUPY  GROUP 
ORGANIZEP  BY  APRIANA  TORRES  THE  FUTURE  IS  THEIRS 


♦  Chapter  1  « 

WHAT  IS  STATISTICS? 


we  MUPPIC  THROUGH  LIFE  MAKING  CHOICC* 
BA5CP  ON  IHCOMPLCTC  IHFORMATIOH... 


WOST  OF  US  LIVE 
COM  FORT ABLY  WITH  SOME 
LEVEL  OF  UNCERTAINTY. 


ffijl Y  '’LL  HAVE  'MtAArrh  Coulp  Vdu  ^ 
A  JWT  •PWfr  MB  A 
CALCULATOR  ?  . 


WHAT  MAKES  STATISTICS  UNIQUE  IS  ITS  ABILOV  TO  QUANTIFY  UNCERTAINTY. 
TO  MAKE  IT  PRECISE-  THIS  ALLOWS  STATISTICIANS  TO  MAKE  CATEGORICAL 
iTATEMEHTi.  WITH  COMPLETE  ASSURANCE-ABOUT  THEIR  LEVEL  OF 
UNCERTAINTY! 


f  &OOQ  CHOICE!  I'M  95%  \ 

CONFIDENT  THAT  TONIGHT*  ' 
SOUP  HAS  PROBABILITY 
BETWEEN  73%  ANP  77%  OF  / 
.  BEIN6  R£4iiy  DELICIOUf!  / 


TO  ACCOMPLISH  THEIR  FEATS  OF  MATHEMATICAL 

LE6ERPEMAIN,  STATISTICIANS  RELY  ON  THREE 

RELATEP  PISCIPLINES 

Data 

analysis 

THE  6ATHER/N6.  PlSPLAY.  ANP 

KA-py  FOR 

SUMMARY  OF  PATA. 

Probability  J 

THE  LAWS  OF  CHANCE.  IN  fL 

&& 

ANP  OUT  OF  THE  CASINO,  1 

Statistical  § 

inference 

fit- 

STATISTICAL  CONCLUSIONS 

FROM  SPECIFIC  PATA  USIN6  A 

KNOWLEPEE  OF  PROBABILITY 

IN  THIS  BOOK.  WELL  LOOK  AT  ALL  THREE.  AS  APPLIEP  TO  A  WIPE  VARIETY  OF 
SITUATIONS  WHERE  STATISTICS  PLAYS  A  CRUCIAL  ROLE  IN  THE  MOPERN  WORLP 


IN  CHAPTER  7  ANP 
BEYONP,  WE  PESCRIBE 
HOW  TO  MAKE 
STATISTICAL  INFERENCES 
IN  SUCH  COMMON  REAL- 
WORLP  ARENAS  AS 
CLGCTIOH  POLLIW. 
MAHUFAOVMH6  QUALITY 
CONTROL,  MGPItAL 
T£fTIH6, 

CHmoHMCHTAL 
MOHITORIH6.  RACIAL 
BIAf.  ANP  THE  LAW. 


(  M  WOUT, 


OUR  HUMBLE  OPINION  I*  THAT  LEARMH6  A  LITTLE  MORE  ABOUT  THE 
SUBJECT  MI6HT  NOT  BE  6U£  H  A  BAP  I  PEA-  ANP  THAT*  WHy  WE  WROTE  THIS 
BOOK/ 


•  CHAPTER  2. 

DATA  DESCRIPTION 


STUPENT  60ES  OVER  EACH  STUPENT'S  REPORTEP  WEI6HT. 


100  150  200 

Weight  in  Pounds 


YOU  MAY  SEE  A  PROBLEM  WERE- 
THE  SLUMPS  AT  190  ANP  199 
POUNPS.  THE  STUPENTS  TEW  PEP 
TO  REPORT  THEIR  WEI6HT  IN 
FIVE-POUNP  INCREMENTS.  IN 
REAL-LIFE  SITUATIONS  LIRE  THIS 
ONE.  SUCH  ROUNPIN6  OFF  CAN 
OBSCURE  GENERAL  PATTERNS  IN 
PATA--  BUT  FOR  NOW,  WE'LL  JUST 
WORK  AROUNP  IT. 


WE  CAN  SUMMARIZE  THE  PATA  WITH  A  FREQUENCY  TABLE.  PIVIPE  THE  NUMBER 
LINE  INTO  INTERVALS  ANP  COUNT  THE  NUMBER  OF  STUPENT  WEIGHTS  WITHIN 
EACH  INTERVAL.  THE  FREQUENCY  IS  THE  COUNT  IN  ANY  6IVEN  INTERVAL  THE 
RELATIVE  FREQUENCY  IS  THE  PROPORTION  OF  WEIGHTS  IN  EACH  INTERVAL. 

I.E.,  IT'S  THE  FREQUENCY  PIVIPEP  BY  THE  TOTAL  NUMBER  OF  STUPENTS. 

CLASS  INTERVAL 

MIPPOINT 

FREQUENCY 

RELATIVE  FREQUENCY 

075-1O2.4 

95 

2 

022 

102.5-117.5 

110 

9 

090 

117.5-1324 

125 

19 

204 

1325-1474 

140 

17 

105 

1475-142  4 

155 

27 

293 

1425-1774 

170 

0 

007 

177.5-192.4 

105 

0 

007 

1925-2075 

2  00 

t 

.011 

2075-222.4 

215 

1 

011 

TOTAL 

92 

1000 

NOTE-  WE  KEPT  THE  INTERVAL  BOUNpARlES  AWAY  FROM  THOSE  TROUBLESOME 
5-POUNP  MULTIPLES.  THIS  SETS  AROUNP  THE  STUPENTS'  REPORTING  BIAS 

PIPELINE*  FOR  FORMIN6  THE  CLASS  INTERVALS 


-  |  USE  INTERVALS  OF 
1  I  EQUAL  LENSTH  WITH 
MIPPOINTS  AT 
CONVENIENT  ROUNP 
NUMBERS. 


2) 


FOR  A  SMALL  PATA 
SET,  USE  A  SMALL 
NUMBER  OF 
INTERVALS. 


3) 


FOR  A  LAR6E  PATA 
SET,  USE  MORE 
INTERVALS! 


10 


IN  THE  FREQUENCY  TABLE,  WE  ARE  SHOWING  HOW  MANY  PATA  POINTS  ARE 
'AROUNP*  EMU  VALUE.  WE  ON  6RAPH  THIS  INFORMATION,  TOO.  THE  RESULTING 
BAR  6RAPH  IS  IALLEP  A  HHT06RAM.  EMU  BAR  COVERS  AN  INTERVAL  ANP  IS 
£ENTER£P  AT  THE  MlPPOINT.  THE  BAR’S  HEI6HT  IS  THE  NUMBER  OF  PATA 
POINTS  IN  THE  INTERVAL 


100  150  200 


Weight  in  Pounds 


WE  cm  ALSO  PRAW  A  RELATIVE  FREQUENCY  MT06RAM,  PLOTTING  THE 
RELATIVE  FREQUENCY  A6AINST  THE  WEI6HT.  IT  LOOKS  EXACTLY  THE  SAME, 
EXCEPT  FOR  THE  VERTICAL  S^ALE 


Weight  in  Pounds 


« 


6000  6RAPHIC  PISPLAY  IS  PART 
ART  ANP  PART  SCIENCE 


AMP  SOMETIMES,  PART 
POLITICS/  - " 


HER  STATISTICAL  EFFORTS  LEP 
PIRECTLY  TO  IMPROVEP  HOSPITAL 
CONPITIOWS  AMP  A  REPUCTIOW  IM  THE 
PEATH  RATE 


'saved  ey^: 
Statistics' 


CRUSAPIM6  MORSE  FLORENCE  HI6HTIH6AL£ 
COM PILEP  MORTALITY  STATISTICS  FROM 
BRITISH  MILITARY  HOSPITALS,  PR0PDCIM6 
SH0CKIM6  HIST06RAMS  LIKE  THIS  OWE  / 
THE  RAPlAL  AXIS 
IMPICATES  PEATHS-IM 
HOSPITALS  AS  WELL  AS 
OW  THE  BATTLEFlELP- 
Cr  BRITISH  SOLPlERS 
IM  THE  CRIMEAM  WAR. 


Any  SET  OF  MEASUREMENTS 
HAS  TWO  IMPORTANT 
PROPERTIES-  THE  CCUTRAL 
OR  TYPICAL  VALUE.  ANP 
THE  fPREAP  ABOUT  THAT 
VALUE-  YOU  aN  SEE  THE 
IPEA  IN  THESE 
HyPOTHETJ^AL  HISTOGRAMS. 


A  SMALL  SET  OF  n  *5  PATA  POINTS  MAKES  THE  BOOKKEEPING  EASy 
SUPPOSE,  FOR  EXAMPLE.  WE  ASK  FIVE  PEOPLE  MOW  MANy  MOURS  OF 
TCL£VlflOH  TMEy  WAT£H  IN  A  WEEK.  ANP  GET  TME  FOLLOWING  ARRAy 


OBSERVATION  12  3  4  5 

PATA  VALUE  5  7  3  36  7 


THEN  ;t,  » f,  •  7,  ;C,  *  3,  XA  -  30,  ANP  »  7- 

\  z' 

WMAT'S  THE  VENTER"  OF 
THESE  PATA?  THERE  ARE 
A^TUALLy  SEVERAL 
PIFFERENT  WAyS  TO 
MEASURE  IT.  WE  LL  LOOK  AT 
JUST  TWO  OF  THEM 


«0~  TO  REPEAT.  THE  AVERAGE.  OR  MCAH,  Of  A  SET  OF  PATA  X,  IS 


n  i  =  ; 

IN  THE  £AS£  OF  OUR  92  PENN  STATE  STUPENTS.  THE  MEAN  WEIGHT  IS 

5*  *i  'Vs* 

tr  *• "  ^ 


145.15  POUNDS 


FOR  THE  n-91  STUPENT  WEIGHTS. 
WE  CAN  FINP  TME  MEPIAN  FROM  THE 
ORPEREP  STEM  ANP-LEAF  PIA6RAM 
JUST  COUNT  TO  THE  44™ 
OBSERVATION.  THE  MEPIAN  IS 

X„  +  *47  145  +  145 

2  ~  2 

*  145  POUNPS 


9=  5 
1(7  :  286 

11  :  (7(72554488 

12  (7(7012355555 

13  :  0000013555488 

14  00002555  5  5  8 

15  :  0000000000355555555557 
14  :  000045 

17  :  000055 

18  0005 

19  :  00005 
20: 

21  :  5 


IN  1984  THE  UNIVERSITY  OF  VIRGINIA  ANNOUNCER 
THAT  ITS  PEPARTMENT  OF  RHETORIC  ANP  COM 
MUNICATIONS  &RAPUATES  MfiAA/  iTAfrtlH6  f ALARY 
WAS  *55,000  THE  OUTLIER.  THE  SALARY  OF  NBA 
CENTER  RALPH  fAMPfOH  PIP  MOT  REPRESENT  THE 
EARNIN6  POWER  OF  A  BA  IN  SPEECH  FROM  U.  OF  V. 
(THE  MEPIAN  SALARY  WASN'T  PUBLISHER) 


WHY  MORE  THAN  ONE  MEASURE  OF  THE  CENTER7  EACH  HAS  APVANTA6ES  FOR 
EXAMPLE,  THE  MCPIAH  IS  NOT  SENSITIVE  TO  OUTLWRi.  OR  EXTREME  VALUES 
NOT  TYPICAL  OF  THE  REST  OF  THE  PATA  SUPPOSE  IN  OUR  SMALL  TV 
WATCHIN6  GROUP,  ONE  PERSON  WATCHES  200  HOURS  PER  WEEK.  THEN  OUR 
PATA  ARE  3,  5,  7,  7.  200.  THE  MEPIAN,  7.  IS  UNC  HANGER.  BUT  THE  MEAN  IS 
NOW  X  -  49.  e! 


MEASURES  OF 


8E5IPES  KNOWING  THE 
CENTRAL  POINT  OF  A  PATA 
SET,  WE'P  ALSO  LIKE  TO 
PESCRIBE  THE  PATA'S 
iPREAP,  OR  HOW  FAR 
FROM  THE  CENTER  THE 
PATA  TENP  TO  RAN&E 
FOR  INSTANCE,  IF  THE 
STUPENTS  ALL  WEI6HEP 
EXACTLY  145  POUNPS, 
THERE  WOULP  BE  NO 
SPREAP  AT  ALL. 
NUMERICALLY  THE  SPREAP 
WOULP  BE  ZERO.  ANP  THE 
HISTOGRAM  WOULP  BE 
SKINNY 


BUT  IF  MANY  OF  THE  STUPENTS  WERE  VERY  LI6HT  ANP/OR  VERY  HEAVY. 
OBVIOUSLY  WE  P  SEE  SOME  SPREAP -SAY  IF  THE  FOOTBALL  TEAM  WAS  PART 
OF  THE  SAMPLE- 


THE  HIST06RAM  WOULP  BE  WIPER.  SOMETHING  LIKE  THIS-. 


"  A6AIN,  THERE'*  MORE  THAW  OWE  WAY  TO  MEASURE  A  SPREAP.  ONE  WAY  IS 

INTERQUARTILE  RANGE 


THE  IPEA  IS  TO  PIVIPE  _ _  _ 

THE  PATA  INTO  FOUR  \ - 1  f  \  f  1  1 

EQUAL  SROURS  ANP  SEE  ' - 1  .  - 1  , 

HOW  FAR  APART  THE  < - 

EXTREME  GROUTS  ARC  ! 


HERE ’5  THE  REOPE 


|  |  PUT  THE  PATA  IN  NUMERICAL 
*  ORPER. 


2) 


PIVIPE  THE  PATA  INTO  TWO 
EQUAL  HI6H  ANP  LOW  &ROUP5 
AT  THE  MEPIAN  (If  THE 
MEPIAN  IS  A  PATA  POINT, 
INILUPE  IT  IN  BOTH  THE  HI6H 
ANP  LOW  GROUPS.) 


3|  FINP  THE  MEPIAN  OF  THE 
*  LOW  6ROUP  TWS  IS  CA LLEP 
THE  FIRST  QUARTILE,  OR  Q, 


4 1  THE  MEPIAN  OF  THE  HI&H 
*  GROUP  IS  THE  THiRP 
QUARTILE.  OR  Q, 


-  Q  OP 

^  HI  bVf? 


MtOINi  OF 

lows 


NOW  THE  INTERQUARTILE  RAN6E  CIQR;  IS  THE  PISTAN^E  (OR  PlFFEREN^E) 
BETWEEN  THEM: 


IQR  Q?,  ~~  Qi 


20 


HERE'S  THE  WEI6HT  PATA 
WITH  THE  MIPPOINTS  OF 
THE  HI6H  ANP  LOW  6ROUPS 
EMPHASIZED 


9  ■  5 
t  O  208 

tl  >  002554400  * 

12  00012355 

13  OOOOOI3555400 

14  :  OOOO2555550 

15  >  00000000003555555555 57 

14  :  000045  ^ 

17  ••  000055  ' 

10 . 0005 

19  00005 
20: 

21  5 


JOHN  TUKEY  INVENTEP  ANOTHER  KINP  OF 
P1SPLAY  TO  SHOW  OFF  THE  I6?R,  ZALLEP  A 
90X  AHP  WMKERf  PLOT.  THE  BOX'S 
ENPS  ARE  THE  QUARTILES  Q,  ANP  <?,.  WE 
PRAW  THE  MEPIAN  INSlPE  THE  BOX. 

«<  Q. 

1 


a<*  (40  in*  (50  iSA 

IF  A  POINT  IS  MORE  THAN  15  IG?R  FROM 
AN  ENP  OF  THE  BOX,  IT'S  AN  OUTLIER. 
PRAW  THE  OUTLIERS  INPIVIPUALLY 


ANP  WE  SEE  THAT 

IQR  »  196  -  129 

»  31  POUNP* 


c=n  • 

ui  Kf  1*5  200 


LEAH  THIS  IS  THE  PIFFERENZE 
BETWEEN  THE  MEPIAN  HEAW 
STUPENT  ANP  MEPIAN  LI6HT  ONE 


FINALLY.  EXTENP  "WHISKERS'  OUT  TO  THE 
FARTHEST  POINTS  THAT  ARE  NOT  OUTLIERS 
(IS,  WITHIN  1.5  IOR  OF  THE  QUARTILES/ 


THE  VARIANT 


ft  WHICH,  FOR  OUR  SIMPLE  PATA  SET,  IS 

IN  f  .^5?  -M.« 


EVEN  FOR  SMALL  PATA  SET* 
THE  ARITHMETIC  CAN  BE 
TEPIOUS!  SO  NOWAPAyS,  WE 
JUST  HIT  THE  S  BUTTON  ON 
THE  HANP  CALCULATOR,  OR 
CONSULT  THE  PATA  REPORT 
6ENERATEP  VY  A  COMPUTER 
SOFTWARE  PACKAGE- 


Properties  of 

X  and  S 

rug  mean  anp  stanparp 

PEVIATION  ARE  VERY  GOOP 
FOR  SUMMARIZING  THE 
PROPERTIES  OF  FAIRLY 
SYMMETRICAL  HISTOGRAMS 
WITHOUT  OUTLIERS-IE, 
HISTOGRAMS  SHAPEP  LIKE 
MOUNPS. 


IT'S  OFTEN  USEFUL  TO  KNOW  HOW  MANY  STANPARP  PEVIATIONS  A  PATA  POINT 
IS  FROM  THE  MEAN  WE  PEFiNE  Z -SCORES,  OR  STANRARPIZEP  SCORES,  AS 
PISTANZE  FROM  X  PER  STAHPARP  PEVIATIOH. 


A  Z  SCORE  OF  +2  MEANS  THAT  AN  OBSERVATION  IS  TWO  STAHPARP 
PEVIATIOHS  ABOVE  THE  MEAH.  FOR  THE  WEIGHT  PATA  (X=U9-7MP 
S-23.7;,  WE  ZAN  PLOT  THE  PATA  ON  THE  ORIGINAL  2 -AXIS  IN  POUNPS  ANP 
THE  z-SCORE  AXIS  SIMULTANEOUSLY 


at 


WEVE  COME  A  LON6  WAy  IN  THI*  CHAPTER'  4TARTIN6  WITH  A  UNOR&ANIZEP 
PILE  OF  NUMBERS.  WE  HAVE 


NOW.  IN  ORPER  TO  PROBE  THE  BEHAVIOR  OF  PATA  MORE  PEEPLX  WE  RE  &OIN6 
TO  MAKE  A  LITTLE  PCTOUR  INTO  THE  REALM  OF  RAHPOMHEif  ■■  A  LANP  WHERE 
THIN&*  ALWAYS  WORK  OUT  IN  THE  LON6  RUN,  ANP  WHERE  THE  ONLy  LAW  I* 
THE  LAW  OF  THE  6AMBUH6  CASIHO... 


♦  Chapter  3« 

PROBABILITY 


© 


O THIN6  IN  LIFE  I*  CERTAIN  IN  EVER/THIN6  WE  PO.  WE 
6AU6E  THE  £HAWES  OF  SUOESSFUL  OUT40ME*.  FROM 
BUSINESS  TO  MEPIONE  TO  THE  WEATHER  BUT  FOR  MOST 


THE  CHEVAL'ER  REASONEP 
THAT  THE  AVERAGE  HUMBER 
OF  SU^ESSFUL  ROLLS  WAS 
THE  SAME  FOR  BOTH  GAMBLES- 
IHANdE  Of  ONE  Si/  ^  ^ 

W6RA6F  MUMPER  1/  . 

FOUR  ROLLS  ■<?  (^  =  5 

£HAM/fc  OF  ROUBLE  \ 

SIK  IN  ONE  ROLL-  *  35 

AVERAGE  NUMBER  IN  * 

M  ROLLS -24  (aV  3 

WHY.  THEN,  PIP  HE  LOiC 
MORE  OFTEH  WITH  THE 
SELONP  SAMBLE"? 


BASIC  DEFINITIONS 

AS  OUR  RAMBLER  PLAY*  A  6AME,  WE  PLAY 
SCIENTIST,  OBSERVING  THE  OUTCOME: 

a  random  experiment 

IS  THE  PROCESS  OF  OBSERVING  THE 
OUTCOME  OF  A  CHANCE  EVENT 

the  elementary 
outcomes  are  all  pos¬ 
sible  RESULTS  OF  THE  RANPOM  EX¬ 
PERIMENT. 

the  sample  space  s 

THE  SET  OR  COLLECTION  OF  ALL  THE 
ELEMENTARY  OUTCOMES. 


THE  SAMPLE  SPA£E  OF  THE  THROW  OF  A  6IH6LC  PIC  IS  A  LITTLE  BI&&ER 


AMP  FOR  A  PAW  OF  Pl£E.  THE  SAMPLE  SPA£E  LOOKS  LIKE  THIS  (WE  MAKE  OME 
PIE  WHITE  ANP  OME  BLA^K  TO  TELL  THEM  APART/. 

U  m  IW  l 1  m  S I  9  EM  HD 

ixffl  09  09  09  09  09 
: :  ffl  : :  ffl  : :  9  : :  9  09  09 
090909  09  09  09 
□ffl  09  09  09  09  09 
Bffl0ffl0SQ9  09  HD 


AT  SOME  POIMT,  WE  HAVE  TO  STOP 
LISTIM6.  AMP  START  THIMKIM6... 


FOR  EXAMPLE.  IN  A  FAIR  COIN 
TOW,  HEAPS  ANP  TAILS  ARE 
EQUALLY  LIKELY,  ANP  WE 
ASSI&N  THEM  BOTH  THE 
PROBABILITY  .5. 

pm  -  pm  -  .5 

EACH  OUTCOME  COMES 
UP  HALF  THE  TIME. 

ASK  ANY  FOOTBALL 
PLAVtt! 


SZ 


aw  THE  LOW6  WIN,  THATS  THE  PROPORTION  OF  TIMES  IT  WILL  OCCUR!) 


BASIC  OPERATIONS 

AO  FAR,  WE  HAVE  Pl*Ol**£P  ONLY  THE 
PROBABILITY  OF  ELEMENTARY  OUTCOME* 
IN  THEORY,  THAT  WOULP  BE  ENOUGH  TO 
PE*£RIB£  ANY  RANPOM  EXPERIMENT,  BUT 
IN  PRAOTIAE  IT*  PRETTY  UNWIELPY.  FOR 
EXAMPLE,  EVEN  WCU  AN  ORPINARY 
O^URRENAE  A*  ROLLIN6  A  *EVEN  I*  NOT 
AN  ELEMENTARY  OUTCOME  .  *0  WE 
INTROPUCE  A  * JEW  IDEA: 


AN  EVENT  1*  A  *ET  OF  ELEMENTARY  OUTCOME*  THE  PROBABILITY  OF  AN 
EVENT  1*  THE  *UM  OF  THE  PROBABILITY*  OF  THE  ELEMENTARY  OUTCOME*  IN 
THE  *ET.  FOR  IN*TAN£E,  *OME  EVENT*  IN  THE  UFE  OF  A  TWO-PIAEP  ROLLER 
ARE- 

EVENT  PETCRIPTION 

EVENT*  ELEMENTARY 

OUTCOME* 

PROBABILITY 

A  PICE  APP  TO  3 

(AM  ft.0) 

B  PICE  APP  TO  A 

{a*).  a*),  a»).  ao) 

«*i*i 

C-  WHITE  PIE  *HOW*  1 

(ao,  (w.  a«.  oa). 

0.9).  0J>)} 

D  BLA OC  PIE  *HOW*  1 

{o.o.  ao.  ao.  ao. 
ao.  ao) 

COMBINING  OUR  PRIMITIVE 
PEFlNlTIONS  OF  PROBABILITY  WITH 
THESE  LOGICAL  OPERATIONS  WILL 
GIVE  US  SOME  POWERFUL 
FORMULAS  FOR  MANIPULATING 
PROBABILITIES 


GAMBLE  (TOMfULSiVELy 

AMD  I  LOST  MV  4MIRT 
/\NP  M  PASCAL  IS  STILL 
WORKING  ON  My  PROBLEM  t 
WHAT  APE  MV 
AVe<  TO,  CKBftei 


’SUM 

OR 

WORSE 


r  LET'S  RETURN  TO  THE  PICE-THROWINS  EXAMPLE  IF  C  IS  THE  EVENT,  WHITE 

PIE  =  1.  ANP  P  IS  THE  EVENT.  BLACK  PIE  =  1.  THEN; 

ENTIRE  SHAPEP 

(:•:[{]  X  ffl  El  88  £  B  E B  ED 

ONE  PIE  OR  THE 

EfflXIffiEBBXBXBClD 

OTHER  IS  tt 

L'ii/ffl/ffl/H/ll-D 

cr  -4A/P  Pis 

LifflCfflcaanBEB  .-n 

SHAPEP  AREAS 

□ffl  ■  (8  ■  ffl  •  B  • BED 

PICE  ARE  0 

> 

THIS  ILLUSTRATES  THE  APMTIOH  PULS.  FOR  AUy  EVENTS  E,  F. 


P(E  OR  F)  =  PfE)  +  PfF)  -  P(E  AND  F) 

APPINS  PfE;  +  PfF;  POUSLQ  COUHTi  THE  ELEMENTARy  OUTCOMES  SHAREP  By 
E  AMP  F,  SO  WE  HAVE  TO  SUBTRACT  THE  EXTRA  AMOUNT.  WHICH  IS  PCE  ANP  F). 


SOMETIME*,  THE  OVERLAP  E  ANP  F  I*  EMPTY,  ANP  THE  TWO  EVENTS  HAVE 
NO  ELEMENTARY  OUTCOMES  IN  COMMON  IN  THAT  CASE,  WE  SAY  E  ANP  F  ARE 
MUTUALLY  EXCLUSIVE,  MAKING  PCE  ANP  F;  =  U  HERE  WE  SEE  THE  MUTUALLY 
EXCLUSIVE  EVENTS  A,  THE  PICE  APP  TO  3,  ANP  B  THE  PICE  APP  TO  6. 

ids  ms  ms  ura  ms  ms 

xfflxffl  :•:*  >:H  X  O  b 
iMffllUHGS  :: Hi  lira 
.••ffl  .--H  EH  .-IS 
,  raises. ra  ra.-o  a 

AHEHI-.H  -  H  -  Hi 

FOR  MUTUALLY  EXCLUSIVE  EVENTS.  WE  GET  A  SPECIAL  APPITION  RULE  IF  E 
ANP  F  ARE  MUTUALLY  EXCLUSIVE.  THEN 

P(E  OR  F)  =  P(E)  +  P(F) 

ANP  WE  CHECK  THAT  PCA  OR  8)  -  ^  +  ^  =  P(aV  P(B) 

ANP  FINALLY,  A  SUBTRACTION  RULE-  FOR  ANY  EVENT  E, 

P(E)  =  1  -  P(NOT  E) 

THIS  IS  USEFUL  WHEN  P(NOT  B)  IS  EASIER  TO  COMPUTE  THAN  P(B)  FOR 
INSTANCE.  LET  E  BE  THE  EVENT,  A  POUBLE  1  IS  NOT  THROWN.  THE  EVENT 
NOT-E,  A  POUBLE  1  IS  THROWN,  HAS  PROBABILITY  PfNOT  E)  •  ~ 

*>  BB  *:  •  ]  HQ  *0  BE  BIB 

•  *-*w*  o  BE  BE  BE  BO  HD  HE 

•  i  -  j*  BE  BE  BE  BE  BE  HE 

.  35  BE  BE  BE  BE  BE  BE 

36  BE  BE  BE  BE  BE  BE 

am  sun  Sim  be  be  sin 


BEFORE  THE  BLACK  PIE  WHAT'*  THE  PROBABILITY  THAT  THE  FACE*  *UM  TO  3? 


HOW  *l*PPO*E  THE 
WHITE  PIE  COME* 
UP  1  (EVENT  Cl 
WHAT  *  THE 
PROBABILITY 
OF  A  NOW? 


40 


BEFORE  ANY  PICE  WERE  THROWN,  THE  SAMPLE  SPACE  HAP  M  OUTCOMES,  BUT 
NOW  THAT  THE  EVENT  C  HAS  OCCURREP,  THE  OUTCOME  MUST  BELONG  TO  THE 
REDUCED  SAMPLE  SPACE  C. 


□  □□□□□ 

DHBIISI 


IN  THE  REPUCEP  SAMPLE  SPACE  OF  SIX  ELEMENTARY  OUTCOMES.  ONLY  ONE 
OUTCOME  0.2)  SUMS  TO  3  SO  THE  CONPITIONAL  PROBABILITY  IS  1/6. 


<&**?) 


IN  GENERAL,  TO  FlNP 
THE  CONPITIONAL 
PROBABILITY  P(EIF), 
WE  LOOK  AT  THE 
EVENT  E  ANP  F  AS 
PART  OF  THE  REPUCEP 
SAMPLE  SPACE  F. 


WE  TRANSLATE  THIS 
INTO  A  FORMAL 
PEFINITION  THE  COUWnoHAl 
PROBABILITY  OF  C.  6IVCN  F.  IS 

P(EIF|  =  p(i-n?Jfl 

pin 

FROM  WHICH  yOU  ON  PIRECTLY 
VERIFY  SOME  INTUITIVE  FACTS'- 

P(G\£)  »  1  conce  c  occurs, 

IT  S  CERTAIN.) 

WHEN  E  ANP  F  ARE  MUTUALLY 
EXCLUSIVE, 

o/rltl  „  CONCE  F  HAS 

PtElF^  »  O  OCCURREP.  C  IS 

IMPOSSIBLE) 


REARRANGING  THE  PEFINITION  GIVES  US  THE  MULTIPIICATIOH  RVLC- 

P(E  AND  F)  =  P(EIF)P(F) 

WHICH  WE  WOULP  LIKE  TO  REPUCE  TO  A  ’SPECIAL"  MULTIPLICATION  RULE, 
UNPER  THE  FAVORABLE  CIRCUMSTANCES  THAT  PCEIF)  .  PfE)  THAT  WOULP  BE 
EXCELLENT! 


/  A  HP  WHILE  YOU'RE  ^ 
'  WAITING  FOR  THE 
NEXT  PAfoE,  NOTE  THAT 
SWAPPIU^  E  AfJP  F 
PROVES  THAT 

.  ?(f)f(e|f)--p(e)p(f|e’) 


INDEPENDENCE  and  the 

special  multiplication  rule. 

TWO  EVENTS  C  AND  F  ARE  tNPCPCHPSHT  OF  EACH  OTHER  IF  THE 
OCCURRENCE  OF  ONE  HAS  NO  INFLUSNCS  ON  THE  PROBABILITY  OF  THE 
OTHER-  FOR  INSTANCE.  THE  ROLL  Of  ONE  PIE  HAS  NO  EFFECT  ON  THE  ROLL 
OF  ANOTHER  CUN  LESS  THEY’RE  6LUEP  TO&ETUER,  MAGNETIC,  ETC.O. 


IN  TERMS  OF  CONPITIONAL  PROBABILITY.  THIS  AMOUNTS  TO  SAYIN6 
P(£)  •  P(GIF)  OR.  EQUIVALENTLY.  P(F)  -  P(F\C).  WHEN  E  ANP  F  ARE 
INPEPCNPENT.  WE  SET  A  fPSClAL  MULTIPLICATION  RULC. 

P(E  AND  F)  =  P(E) P(F) 


LET'S  VERIFY  THE  INPEPENPENCE  OF  PICE.  USIN&  THE  FORMULAS  C  IS  THE 
EVENT  WRITS  PIS  COMSf  UP  U  P  IS  THE  EVENT  BLACK  PIS  COMSf  UP  1,  ANP 
WE  HAVE:  , 

BUT  THE  WHITE  PIE  SHOWING  1  OBVIOUSLY  POES  AFFECT  THE  CHANCES  THAT 
THE  SUM  OF  THE  TWO  PICE  IS  V. 

FttIO  .* *»*?  i  .1  4  wo  -  x 

pcc?  pcc;  t  b  18 

SO  THESE  TWO  EVENTS  ARE  NOT  IHPSPCNPCHT. 


4S 


BEFORE  &OIN&  ON,  LET'S  SUMMARIZE  ALL  THE  RULES  WEVE  ACCUMULATEP: 


APPITION  RULE: 

P(E  OR  F)  =  P(E)  +  P(F)  -  P(E  AND  F) 


SPECIAL  APPfTION  RULE:  WHEN  E  ANP  F  ARE 
AlUTUALLy  EXCLUSIVE, 

P(E  OR  F)  a  P(E)  +  P(F) 

SUBTRACTION  RULE: 

P(E)  a  1  -  P(NOT  E) 

MULTIPLICATION  RULE: 

P(E  AND  F)  =  P(EIF)P(F) 

SPECIAL  MULTIPLICATION  RULE:  WHEN  E 
ANP  F  ARE  INPEPENPENT, 

P(E  AND  F)  a  P(E)P(F) 


ANP  NOW,  PE  MERE'S  PROBLEM  AT  LAST...  LET  E  BE  THE  EVENT  OF  6ETTIN6 
AT  LEAST  ONE  SIX  IN  FOUR  ROLLS  OF  A  SINGLE  PIE  WHAT'S  PCE;?  THIS  IS 
ONE  OF  THOSE  EVENTS  WHOSE  NEGATIVE  IS  EASIER  TO  PESCRIBE  NOT  £  1$ 
THE  EVENT  OF  6ET7IN6  NO  fIXCf  IN  FOUR  THROW*. 


IF  A i  IS  THE  EVENT,  6ETT1N6  NO 
SIX  ON  THE  Ith  THROW,  WE  KNOW 
THAT  PfAj ;  -  \  WE  ALSO  KNOW 
THAT  ROLLS  ARE  INPEPENPENT,  SO 

P(NOT  E)  = 

P(A,  ANP  Aj  ANP  A,  ANp  k4) 

MOLliPLKATiON  ,  ex4 

roll  - - -  .(£)  .  .462, 


-  PfNOT  E)  --  .510 


HOW  THE  5ECON P  HALF  LET  F  BE  THE  EVENT,  6CTTIN&  AT  LEA5T  ONE 
POUBLE  5lX  IN  24  THROW5  A6AIN,  NOT  F  15  EASIER  TO  PE5CRIBE.  I T'5  THE 
EVENT  OF  6ETTIN6  NO  POUBLE  5lXE5 

IF  9,  15  THE  EVENT,  NO  POUBLE 
5lX  15  THROWN  ON  THE  I™ 
ROLL.  THEN  NOT  F  =  B,  ANP  B2 
ANP_  B„  THE  PROBABlLlTy  OF 
EACH  B  15 

W  ^  •  *> 

PfNOT  F)  =  (|[)  =  5P9 

TBy  THE  MULTIPLICATION  RULE; 
ANP  WE  CONCLUPE  THAT 

pcf;  - 1  -  pc  not  f;  .  1  -  .909 

*  .491 


PE  MERE  TOLP  PA5CAL  HE  HAP  ACTUALLY  OB5ERVEP  THAT  EVENT  F  OCCURREP 
LE55  OFTEN  THAN  EVENT  E,  BUT  HE  WA5  AT  A  1055  TO  EXPLAIN  WHY  FROM 
WHICH  WE  CONCLUPE  THAT  PE  MERE  6AMBLEP  OFTEN  ANP  KEPT  CAREFUL 
RECORPf!! 


45 


SUPPOSE  A  RARE  PISEASE  INFECTS  ONE  OUT  OF  EVERY  1000  PEOPLE  IN  A 
POPULATION 


ANP  SUPPOSE  THAT  THERE  IS  A  6O0P.  BUT  NOT  PERFECT,  TEST  FOR  THIS 
PISEASE  IF  A  PERSON  HAS  THE  PlSEASE,  THE  TEST  03MES  BAZY  POSITIVE 
OF  THE  TIME.  ON  THE  OTHER  HANP,  THE  TEST  ALSO  PROPWES  SOME  FAlfC 
roflTtVC*.  ABOUT  2%  OF  UNINFEOTEP  PATIENTS  ALSO  TEST  POSITIVE.  ANP  YOU 
JUST  TESTEP  POSITIVE.  WHAT  ARE  YOUR  CHANGES  OF  HAVIN6  THE  PlSEASE? 


WE  NAVE  TWO  EVENTS  TO  WORK  WITH- 


A  PATIENT  HAS  THE  PlSEASE 
8  PATIENT  TESTS  POSITIVE. 

THE  INFORMATION  ABOUT  THE  TEST’S 
EFFECTIVENESS  £AN  BE  WRITTEN 


p(a;  » .001 


TONE  PATIENT  IN  WOO  HAS  THE  PlSEASE; 


PtflA)  =  .99 

p^binot  a;  *  .Oi 

ANP  WE  ASK 


(PROBABILITY  OF  A  POSITIVE  TEST. 

6IVEN  INFECTION,  IS  99) 

(PROBABILITY  OF  A  FALSE  POSITIVE.  6IVEN 
NO  INFECTION,  IS  .P2J 


p&ib;  «  WHAT? 


(PROBABILITY  OF  HAVINC  THE  PlSEASE, 
6IVEN  A  POSITIVE  TEW 


JOE  BEGINS  WITH  A  2X2  TABLE.  WHICH  PIVIPE*  THE  *  AMPLE  SPACE  INTO  FOUR 
MUTUALLY  EXCLUSIVE  EVENTS-  IT  PfSPLAYS  EVERY  POSSIBLE  COMBINATION  OF 
PISEASE  STATE  ANP  TEST  RESULT. 


B  A  ANP  B 

NOT  B  A  ANP  NOT  6 


NOT  A 

NOT  A  ANP  B 
NOT  A  ANP  NOT  B  i 


LET  S  FINP  THE  PROBABILITIES  OF  EACH  EVENT  IN  THE  TABLE: 


PCA  ANP  b;  PCNOT  a  ANP  B) 

PCA  ANP  NOT  e)  PCNOT  A  ANP  NOT  9) 

pcb; 

pcnot  b; 

pca;  pcnot  a; 

i 

pca  anp  b;  =  pcbIama;  =  (jhXjooo  =  .00099 

P(NOT  A  ANP  B;  -  PCBlNOT  AMNOT  k)  *  (M)(.999)  *  .01990 
ALLOWING  US  TO  FILL  IN  SOME  ENTRIES 


WE  FlNP  THE  REMAINING  PROBABILITIES  BY  SUBTRACTING  IN  THE  COLUMNS,  THEN 
APPlNG  ACROSS  THE  ROWS. 


THE  FINAL  TABLE  I * 


B 

NOT  B 


A _ NOT  A  _ 

00099  -01990  .02097 

.00001  .97902  97909 

.001  .999  1 

PM  PCNOT  A) 


kb; 

KNOT  b; 


from  whi£h  we  piREmy  perive 


p^aIb; 


P(A  AMP  B)  _  .00099 
P(B)  '  .02097 


.0471 


WHAT'S  THE  PHYSICIAN  TO  PO?  TOE  BAYES  APVISES  HER  WOT  TO  START 
TREATMENT  ON  THE  BASIS  OF  THIS  TEST  ALONE  THE  TEST  POES  PROVIPE 
INFORMATION,  HOWEVER-  WITH  A  POSITIVE  TEST  THE  PATIENTS  CHANCE  OF 
HAVIN6  THE  PISEASE  INCREASE?  FROM  I  IN  WOO  TO  1  IN  23.  THE  POCTOR 
FOLLOWS  UP  WITH  MORE  TESTS. 


TOE  BAYES  CO LLECTS  HlS  CONSULTING  CHECK  BEFORE  APMITTIN6  THAT  ALL 
THOSE  STEPS  HE  WENT  THROUGH  CA N  BE  COMPR ESSEP  INTO  THE  SINGLE 
FORMULA  CALLEP  BAYES  THEOREM; 


P(AIB) 


_ P(A)P(BIA) _ 

P(A)P(BlA)+P(NOT  A)P(B  I  NOT  A) 


rr  computes  p^aib;  from  pm  anp  the  two  conpitional  probabilities 

PCBlA;  ANP  PfBlNOT  A)  you  CAN  PERIVE  IT  BY  NOTING  that  THE  BIO  FRACTION 
CAN  BE  EXPRESSEP  AS 


_ P(A  and  B| _ 

P(A  and  BJ+PfNOT  A  and  B) 


P|A  and  BJ  _  p(A|B, 
P(B) 


50 


♦  Chapter  4  ♦ 

RANDOM  VARIABLES 


IN  CHAPTER  2.  WE  SAW  THAT  OBSERVATIONS  OF  NUMERICAL 
PATA,  LIKE  STUPENTS'  WEI6HTS,  CAN  BE  6RAPHEP  ANP 
SUMMARIZEP  IN  TERMS  OF  MIPPOINTS.  SPREAPS,  OUTLIERS.  ETC. 
IN  CHAPTER  3.  WE  SAW  HOW  PROBABILITIES  CAN  BE  ASSI6NEP 
TO  THE  OUTCOMES  OF  A  RAN  POM  EXPERIMENT 


IF  WE  IMA&INE  A  RANPOM  EXPERIMENT  REPEATEP  MANY  TIMES. 
WE  EXPECT  THAT  THE  ACTUAL  OUTCOMES  OVER  TIME  WILL  BE 
60VERNEP  BY  THEIR  PROBABILITIES  THE  PROBABILITIES  FORM  A 
MOPEl  FOR  REAL-LIFE  EXPERIMENTS  .  SO  WHY  NOT  PO  FOR  THE 
MOPEL  WHAT  WEVE  ALREAPY  PONE  FOR  THE  PATA  IT  PESCRIBES? 


5TUPENT  BOpy  TMA Ti  THE  RANPOM  EXPERIMENT.  THE  DTUPENT5  HEI6HT, 
WEI6HT,  FAMILY  INCOME.  fXT.  *CORE.  AW  CRAPE  POINT  AVERACE  ARE 
ALL  NUMERICAL  VARIABLE*  PEDCRIBIN6  PROPERTIED  OF  THE  RANPOMLy 
DELECTEP  DTUPENT  THEVRE  ALL  RANDOM  VARIABLE #. 


ANOTHER  EXAMPLE-  TODD  TWO  COlND  (THE  RAN  POM  EXPERIMENT;  ANP  RECORP 
THE  NUMBER  OF  HEAPD;  O.  1.  OR  2 


NOTE  THE  NOTATION!  THE  VARIABLE  ID  WRITTEN  WITH  A  CAPITAL  X.  THE 
LOWERCADE  X  REPREDENTD  A  D1N6LE  VALUE  OF  X.  FOR  EXAMPLE  X‘2,  IF 
HEAPD  LOME*  UP  TWICE 


ANOTHER  EXAMPLE  IS 
BASEP  ON  THE  FAMILIAR 
TOW  OF  TWO  PICE  LET 
y  REPRESENT  THE  SUM 
OF  THE  POTS  ON  THE 
TWO  PICE  FOR  THIS 
RANPOM  VARIABLE,  V 
CAN  BE  ANy  NUMBER 
BETWEEN  2  ANP  12. 


j]  Y =7 


WOW  LET'S  PRAW  6RAPHS,  OR  HI*T06RAM$,  SHOWING  THESE 
PROBABILITY  PISTRIBUTIONS-  FOR  EACH  VALUE  OF  X,  WE  PRAW  A  BAR 
EQUAL  IH  HEI6HT  TO  /?(.*)■ 


0  I  ^ 

ITS  EASY  TO  SEE  THAT  THE  TOTAL  AREA  OF  THESE  BOXES  IS  t  EACH  BOX  HAS 
BASE  I  AMP  HEI6HT  p(x),  SO  THE  TOTAL  AREA  IS  THE  SUM  OF  THE 
PROBABILITIES  OF  ALL  OUTCOMES.  f-E-  I. 


HERE'S  THE  PROBABIUTY  HISTOSRAM  OF  THE  RAW  POM  VARIABLE  Y.  SHOWING 
THE  PROBABILITY  PISTRIBUTION  OF  THE  SUM  OF  TWO  PICE 


56 


WHY  PO  WE  CALL  THESE  6RAPHS  HlSTO&RAMS?  YOU'LL  RECALL  THAT  IN 
CHAPTER  2,  A  HIST06RAM  WAS  A  6RAPH  THAT  PISPLAYEP  HOW  MANY  PATA 
POINTS  LAY  IN  EACH  Of  A  SERIES  OF  INTERVALS: 

25 
20 
15 
10 
5 
0 

Weight  in  Pounds 

FROM  THIS  FREQUENCY  HISTOGRAM,  WE  PERIVEP  THE  RELATIVE  FREQUENCY 
HISTOGRAM.  SHOWING  THE  PROPORTION  OP  PATA  IN  EACH  INTERVAL: 


0.2 
0.1 

0.0  . 

100  150  200 

Weight  in  Pounds 


BUT  YOU'LL  RECALL  THAT,  BY 
ONE  PEFINITION,  PROBABILITY 
IS  THE  RELATIVE  FREQUENCY 
OF  AN  EVENT  * IN  THE 
LONE  RUN."  IF  WE  REPEAT 
THE  RANPOM  EXPERIMENT 
MANY  TIMES,  THE  RELATIVE 
FREQUENCY  HIST06RAM  OF 
THE  OUTCOMES  SHOULP  COME 
TO  LOOK  VERY  MUCH  UKE 
THE  RANPOM  VARIABLE’S 
PROBABILITY  HISTOGRAM' 


ST 


WE  KNOW  X  *  PROBABILITY  PISTRIBUTION.  ANP  WE  ALSO  KNOW  THAT  THE 
ACTUAL  COIN  FLIPS  WILL  MATCH  THE  PROBABILITIES  APPROXIMATELY  AFTER 
WOO  TOSSES,  THE  MAP  TOSSER  TALLIES  HER  PATA- 


OBSERVEP  PATA 

nx  NUMBER  OF  «  RELATIVE 

OCCll  RRENCES  FREQUENCY 

.29  O  UO  .UO 

.5  t  917  .917 

.29  2  223  .223 


PROBABILITY 

MOPEL 

p(Xl 


ANP  WE  SEE  THAT  THE  PROBABILITY  HISTOGRAM  OF  X  LOOKS  LIKE  THE  'PURE 
FORM*  OR  MOPEL  OF  THE  RELATIVE  FREQUENCY  HISTOGRAM  OF  THE  PATA. 


TO  EXTEMP  THE  AMALO&y  BETWEEN  RELATIVE  FREQUENT  AMP  PATA,  WE 
SUOULP  MOW  BE  WILLI  M£  TO  TALK  ABOUT  TME  MEAM  AMP  VARIANCE  COR 
STAMPARP  PEVIATIOM;  OF  A  PROBABIUTy  PI5TRIBUTIOM ... 


AMP  JUST  TO  REMIMP 
OURSELVES  TMAT  WE  RE  IM 
THE  REALM  OF  THE 
ABSTRACT,  WE  BREAK  OUT 
SOME  GREEK  LETTER*... 


THE  SAMPLE  MEAN  WAS  PEFINEP 
BY  THE  EQUATION 


*  ~  ££* 


GOO P.'  WOW  LET'S 
Twi&t  IT  AROUMP/  . 


WOW  SOME  OF  THESE  PATA  POINTS  X,  MAY  WELL  HAVE  EQUAL  VALUES.  THINK 
OF  THE  MAP  COIN  TOSSER--  THE  ONLY  AVAILABLE  VALUES  WERE  O.  t.  ANP  2.  ANP 
SHE  MAPE  WOO  TOSSES.  THE  VALUE  0  WAS  TAKEN  ON  260  TIMES,  t  HEAP  CAME 
UP  517  TIMES.  ANP  2  HEAPS.  223  TIMES. 


AS  WE  LET  X  RANGE  OVER 
ALL  VALUES  OF  X,  CALL  nx 
THE  NUMBER  OF  PATA 
POINTS  WITH  THE  VALUE  X 
THEN  WE  CAN  REWRITE 
THAT  FORMULA  AS 


BECAUSE  EACH 
%  \*>  COUMTEP 

n*  times  y 


AH!  BUT  NOW  IS  THE  RELATIVE  FREQUENCE-  THE  APPROXIMATE 
PROBABILITY ."  THE  NUMBER  THAT  APPROACHES  pKX)- 60,  BY  ANALOGY.  WE 


^ Xp(Z ) 


ANP  PEFINE  THAT  AS  THE 
MC4A/  OF  TNG  PROBABILITY 
PlfTRIBUTIOH. 


PCFINITIOH:  THE 

mean  of  the 

RANPOM  VARIABLE  X  IS 
PEFINEP  AS 

all* 


THIS  15  ALSO  IALLEP  THE  CXPCCTCD  VALUC  OF  X.  OR  E[x].  THINK  OF  IT  AS 
THE  5UM  OF  THE  POSSIBLE  VALUES,  EAIH  WEI6HTEP  BY  ITS  PROBABILITY. 


THE  MAP  £OIN  TOSSER'S  EXPERIMENT  ALLOWS  US  TO  COMPARE  HER  SAMPLE 
MEAN  X  WfTH  OUR  MOPEL  MEAN  /c 


SAMPLE  MOPEL 


TO  SUM  UP;  m  ANP  cr.  THE  POPULATION  ME/W  ANP  iTAUPARP  £>£VM770W,ARE 
PROPERTIED  WE  CAN  COMPUTE  FROM  PROBABILITY  PlfTRIBVTIOHf.  THEY  ARE 
COMPLETELY  ANALO&OUD  TO  THE  SAMPLE  MEAN  X  ANP  STANPARP  PEVIATION  5 
COMPUTEP  FROM  SAMPLE  PATA 


OUR  EXAMPLES  SO  FAR  HAVE  BEEN  PlfZRCTC  RANPOM  VARIABLES.  THEIR 
OUTCOMES  ARE  A  SET  OF  ISOLATE?  {"P1SCRETE";  VALUES,  LIKE  THOSE  WE  SAW 
IN  CHAPTER  3,  BUT  THERE  ARE  ALSO 

Continuous 
Random 
Variables 

LET'S  IMAGINE  A  RANPOM  EXPERIMENT 
IN  WHICH  ALL  OUTtOME*  HAVE 
PROBABILITY  ZERO ■  THAT'S  RI6HT, 
pLX)  -  O  FOR  EVERy  X. 


HOW  CAW  WE  PRAW  A  PICTURE  OF  THIS? 
BY  ANALOG  WITH  THE  CASE  OF 
PISCRETE  PROBABILITIES,  WE  TRV  TO 
SEE  COWTINUOUS  PROBABILITIES  AS 
ARCAi  UNPCR  fOMCTWM.  FOR  THE 
SPINNING-  POINTER,  THE  'SOMETHING* 
LOOKS  UKE  THIS- 


THE  PROBABILITY  OF  AN  EXACT 
OUTCOME,  HOWEVER.  IS  THE  'AREA' 
OVER  A  POINT,  WHICH  IS  ZERO- 
CAN  P  NOTE  THAT  THE  TOTAL  AREA 
UN  PER  THE  CURVE  IS  EXACTLY  U 


IN  GENERAL,  THE  PROBABILITY 

pewity  won  t  be  40  simple, 

ANP  COM PLTTIN6  THE  AREAS  CAN 
BE  FAR  FROM  TRIVIAL 


WE  HAVE  TO  WE  CALCULUS 
NOTATION  TO  PESCRlBE  THE 
AREA  UNPER  THE  CURVE  f(Z) 
THIS  SYMBOL  IS  REAP  'THE 
INTEGRAL  OF  f  FROM  a  TO  b’ 


NOW  LET'S  PIM  A  SIMPLE  6AMBUN6  6AME  YOU  ANTE  UP  it  oo  TO  PLAY,  I 
FLIP  A  COIN,  YOU  WIN  Ho  IF  THE  COIN  COMES  UP  HEAPS,  ZERO  IF  TAILS.  THEN 
YOUR  WINNINGS  W  ARE 


W  =  wX  -  6 

A  NEW  RANPOM  VARIABLE' 
WHAT  ARE  ITS  MEAN  ANP 
VARIANCE? 


68 


4-0- -» 


IN  GENERAL  IT  15  NOT  HARP 
TO  5  WOW  THAT 

E  [aX+-£]  =  «E[x]  +i> 

WHEN  a  ANP  b  ARE  ANY 
NUMBER5  ANP  X  15  ANY  RAN  POM 
VARIABLE.  FOR  THE  VARIANCE, 
THERE'5  AL50  A  GENERAL 
RE5ULT; 

<r1(aX+b')  -  cPcrHX) 


you  £AN  ALSO  APP  TV/0  RAN  POM  VARIABLES  TOGETHER.  FOR  INSTANCE,  SUP¬ 
POSE  WE  TOSS  A  COiH  TMICC.  THE  NUMBER  OF  HEAPS  ON  BOTH  TOSSES  IS 
X,+X2,  WHERE  X,  ANP  X2  ARE  THE  RANPOM  VARIABLES  GIVING  THE  RESULTS 
OF  THE  FIRST  ANP  SE£ONP  TOSSES. 


p(x,+Z2)\  .25  .5  .25 

AGAIN,  ITS  EASY  TO  SEE  THAT 

E[x,+xa]  -  e[x,]+e[x2] 


THE  VARIANCE  OF  THE  SUM  OF  RANPQM  VARIABLES  HAS  A  SIMPLE  FORM  IN 
THE  SPECIAL  aSE  WHEN  THE  VARIABLES  X  ANP  Y  ARE  tWCrCHDCHT.  THE 
TE£HNl£AL  PEFINITION  OF  INPEPENPEN^E  IS  BASEP  ON  THE  PROBABILITY 
PROPERTY  PCA  ANP  BJ  -  fWf¥B>-  BUT  FOR  US.  INPEPENPENCE  JUST  MEANS 
THAT  X  ANP  Y  ARE  6ENERATEP  BY  IHPCPCNPCHT  MCtHANISMS,  SWH  AS 


ALL  OF  THIS  £AN  BE  6ENERALIZEP  TO  THE  SUM  OF  MANY  RANPOM  VARIABLES- 


ANP,  WHEN  THE  X,  ARE  ALL  INPEPENPENT, 


n 


IN  THE  NEXT  CHAPTER,  WE  WILL  SEE  TWO  IMPORTANT  EXAMPLES  OF  RANPOM 
VARIABLES-  ONE,  THE  BINOMIAL,  IS  THE  SUM  OF  MANY  REPEATEP  INPEPENPENT 
RANPOM  VARIABLES  THE  OTHER,  THE  NORMAL,  IS  A  CONTINUOUS  RANPOM 
VARIABLE  THAT  HAS  A  SURPRISING  REUTIONSHIP  TO  THE  BINOMIAL,  ANP  ANy 
OTHER  SUM  OF  INPEPENPENT  RANPOM  VARIABLES  AS  WELL 


♦  Chapter  5  ♦ 

A  TALE  OF  TWO 
DISTRIBUTIONS 


WOW  WE  LOOK  AT  TWO  IMPORTANT  EXAMPLES  OF 
RANPOM  VARIABLES.  ONE  PlSCRETE  ANP  ONE  CONTINUOUS. 


TJ 


WE  BE&lW  WITH  THE  PISCRETE  OWE,  CALLEP  THE  BINOMIAL  RAW  POM  VARIABLE 
SUPPOSE  WE  HAVE  A  RAW  POM  PROCESS  WITH  JUST  TWO  POSSIBLE  OUTCOMES. 
A  HEAPS-OR -TAILS  COIW  TOSS,  A  WlW-OR-LOSE  FOOTBALL  U ME,  A  PASS-OR 
FAIL  AUTOMOTIVE  SM06  IWSPECTIOW.  WE  ARBITRARILY  CALL  OWE  OF  THESE 
OUTCOMES  A  SUCCESS  AWP  THE  OTHER  A  FAILURE. 


COH6PATulATioWS  Ol 

-youp  success'  voog  c 

OUST  PM  CEP  TME 

SMO&  TEST* 


t;  THE  RESULT  OF  EACH  TRIAL 
MAY  BE  EITHER  A  SUCCESS  OR 
A  FAILURE 

2)  THE  PROBABILITY  p  Op 
SUCCESS  IS  THE  SAME  IW 
EVERY  TRIAL 

1)  THE  TRIALS  ARE  INPEPENPENT i 
THE  OUTCOME  OF  OWE  TRIAL  HAS 
WO  IWFLUEWCE  OW  LATER  OUTCOMES 


STARTING  WITH  A  BERNOULLI  TRIAL,  WITH  PROBABILITY  OF  SU&ESS  p,  LET'S 
BUILP  A  NEW  RAN  POM  VARIABLE  BY  REPEATING  THE  BERNOULLI  TRIAL. 


The 

binomial 

random 

variable 

X  IS  THE  HUM  BCR  OF 
WCCCiiCi  IN  n  REPEATEP 
BERNOULLI  TRIALS  WITH 
PROBABILITY  p  OF  SUOXSS 


FOUR  TIMES  IN  A  ROW  SUCCESS  MEANS  ROLLING  A  b.  THE  PISTRIBUTION  IS: 


75 


IN  GENERAL,  WHATS  THE  PROS- 
ABILITY  PlSTRlBWlON  OF  THE 
BINOMIAL  FOR  AHY  PROBABIUTY 
p  ANP  NUMBER  OF  TRIALS  n?  A 
PROBABILITY  CA UUUTlON  6IVES 
THE  ANSWER-  THE  PROBABIUTY 
OF  OBTAINING  A  SUCCESSES  IN 
n  TRIALS,  Pr(X-A),  IS 

Pr(X  =■£)  =  (Up^O-py* 


HERE  (1).  REAP  ‘/7  CHOOSE  A'  IS  THE  BIHOMIAL  COCFFMCHT.  IT  COUNTS 
ALL  POSSIBLE  WAYS  OF  &ETTIN6  A  SUCCESSES  IN  n  TRIALS.  EACH  INPIVIPUAL 
SEQUENCE  OF  A  SUCCESSES  ANP  n~A  FAILURES  HAS  PROBABILITY  pl{\-pVk, 
BY  THE  MULTIPLICATION  RULE.  THERE  ARE  (%)  OF  THESE  SEQUENCES 


THE  FORMULA  FOR  (%)  IS 

©  =  i'Cn-hi 

WHERE 

«l  *  -  x  1 


ANP  O'  IS  TAKEN  TO  BE  1.  FOR  INSTANCE. 
(*),  THE  NUMBER  OF  POSSIBLE  WAYS  TO 
CHOOSE  TWO  LETTERS  FROM  A  SET  OF 
FOUR  LETTERS,  IS 


o= 


A! 

Hit 


4 


IAICW 

AB  Ac  AD 
BC  BD  CP 


76 


ANOTHER  VIEW  OF  THE  BINOMIAL  ZOE FFIZIENT5  15  IN  PASCAL'S  TRIAH6LC- 
EAZH  ENTRY  15  THE  5UM  OF  THE  TWO  NUMBER5  JU5T  ABOVE  IT 


(il  V.,',’, 


10  15  A  t 

35  35  21  7  1  . 


'VI  7  21  35  35  21  7 

*  1  6  29  54  70  96  10  0 


1  9  34  64  126  124  04  36  /  9 


/’  '  yUc 

49  to  I*: 


t  10  45  120  210  252  210  120  *45  lO  1  k- 

1  11  55  145  330  462  442  330  145  55  11  1 

12  44  220  495  792  924  792  495  220  44  12  1 


TO  FINP  (jQ,  JU5T  ZOONT  POWN  TO  ROW  n  ANP  OVER  TO  ENTRY  k 
(REMEMBERING  ALWAY5  TO  5TART  COUNTING  FROM  ZERO}. 


WHEN  p  *  .5.  THE  BINQMIAL’5 
PROBABILITY  PI5TRIBUTION  15 
PERFECTLY  5YMMETRIZAL.  FOR 
6  ZOIN  FLIP5,  FOR  IN5TANZE,  IT5 

4  »  #HEAP5  O  1  2  3  4  5  6 

(if  &i  at*  at*  fif«  aft  (if 


O  1  i  3  ^  ^  6 


77 


FOR  PE  MERE'S  ROLL  OF  FOUR  PICE.  THE  DISTRIBUTION  IS  MORE  LOPSIDED' 


THE  MEAN  AND  VARIANCE  OF  THE 
BINOMIAL  DISTRIBUTION  ARE 


NOTE  THAT  THE  MEAN  MAKES 
INTUITIVE  SENSE  IN  /?  BERNOULLI 
TRIALS,  THE  EXPECTED  NUMBER  OF 
SUCCESSES  SHOULD  BE  /?/>■  THE 
VARIANCE  FOLLOWS  FROM  THE 
FACT  THAT  THE  BINOMIAL  IS  THE 
SUM  OF  n  INDEPENDENT  BERNOULLI 
TRIALS  OF  VARIANCE  pO-p). 


THE  PARAMETERS  OF  THE  BINOMIAL  DISTRIBUTION  ARE  n  AND  p.  THE 
DISTRIBUTION,  MEAN,  AND  VARIANCE  DEPEND  ONLY  ON  THESE  TWO  NUMBERS- 
TABLES  OF  THE  BINOMIAL  DISTRIBUTION  APPEAR  IN  MOST  TEXTBOOKS  AND 
COMPUTER  PROGRAMS  HERE  IS  A  TABLE  FOR  71  =  10. 

VALUES  OF  Pr(X-A) 


01  234  56789  10 

1  0  349  0  387  0  194  0.057  0.011  0.001  0.000  0  000  0.000  0  000  0  000 

.25  0.056  0  188  0.282  0.250  0.146  0.058  0.016  0.003  0.000  0.000  0.000 

P  .50  0.001  0.010  0.044  0  117  0.205  0.246  0.205  0.117  0  044  0.010  0.001 

.75  0.000  0.000  0.000  0003  0.016  0  058  0.146  0  250  0  282  0  188  0.056 

9  0.000  0.000  0  000  0  000  0.000  0  001  0.011  0.057  0  194  0  387  0  349 


78 


PEPLOyiNd.  A  NEWLY 
INVENTEP  WEAPON,  THE 
CALCULI)*,  PE  MOIVRE 
SHOWEP  THAN  WHEN  p 
THE  BINOMIAL  PISTRIBUTION 
WAS  CLOSELY 
APPROXIMATEP  By  A 
conmuou *  KHsiry 
FUHCTIOH  WHIdH  dOULP  BE 
PESdRIBEP  VERy  SIMPLV 


TO  SEE  HOW  THIS  WORKS,  IMAGINE  THE  BINOMIAL  PISTRI8UTION  WITH  p  •  .5 
ANP  n  VERy  LAR6E-A  MILLION,  SAy 


SQUASH  THE  CURVE  ALONG  THE  ; 
UNTIL  THE  STANDARD  DEVIATION 
BECOMES  1.  WHILE  STRETCHING  n 
ALONG  THE  V  AXIS  TO  KEEP  THE 
UNDER  IT  EQUAL  TO  I. 


THE  RESULT  IS  VERY  CLOSE  TO  A  iMOOTU.  fyAMSTRMAL.  KU-SHATCP 
£URV£,  WHICH  PEMOIVRE  SHOWED  WAS  GIVEN  BY  THE  SIMPLE  FORMULA: 

THIS  FUNCTION  IS  CALLEP  THE 

standard  normal 

distribution. 

(e  is  A  USEFUL  MATHEMATICAL 

CONSTANT  APPROXIMATELY 
EQUAL  to  2.7'8.) 

(CONVINCE  YOURSELF  THAT  THIS  FUNCTION  REALLY  HAS  A  BELL-SHAPED 
GRAPH  FOR  Z  FAR  FROM  ZERO.  f(z)  IS  VERY  NEARLY  ZERO-IT  HAS  A  BIG 
DENOMINATOR.  IT'S  SYMMETRICAL.  SINCE  f(z)  -  f(~z).  ANP  IT  HAS  A 
MAXIMUM  AT  z  -  O.) 

THE  DISTRIBUTION  IS  CALLEP  THE 

fTAHPARP  NORMAL  BECAUSE  ALL  Jj  Q 

THAT  SQUASHING  ANP  STRETCHING  ^ 

WAS  SPECIALLY  ARRANGED  TO  GIVE  . 

IT  THESE  SIMPLE  PROPERTIES.  O'  ^  | 

WHICH  WE  PRESENT  WITHOUT 
PROOF: 


TO  SUMMARIZE  PE  MOIVRE, 
IF  yOU  “NORMALIZE"  THE 
BINOMIAL  PISTRIBl/TION 
WITH  p  .  1/2— I  E ,  /ENTER 
IT  ON  ZERO  ANP  MAKE  ITS 
STANPARP  PEVIATION  »  I. 
THEN  IT  CLOSELy  FITS 
THE  STANPARP  NORMAL 
PlfTRIBUTlON 

f(z)  = 


(' 6  8oT  .  .  ^ 
ABOUT 
1*6  C  C  C- 


r  TkXAT  v<AS  ToR  n 
Oe/AO'VRE, 

\  AoT  FbR  O'*  > 


OTHER  NORMALS.  WITH  PIFFERENT  MEANS  ANP  VARIANCES.  ARE  OBTAIN EP  By 
STRETCHING  ANP  SLIPING  THE  STANPARP  NORMAL  IN  GENERAL,  WE  WRITE  THE 
FORMULA 

THIS  GIVES  A  SVMMETRIC, 

1  i/*-/Q»  BELL-SHAPEP  PlSTRIBUTION 

\u.a)  =•  — *  '  CENTEREP  ON  THE  MEAN  p. 
t7V27r  WITH  THE  STANPARP 

PEVIATION  <r. 


HERE  ARE  TWO  PIFFERENT  NORMALS  WITH  THE  REGIONS  WITHIN  THEIR 
STANPARP  PEVIATIONS  SHAPEP. 


PE  MOIVRE  PROVED  THAT  THE  STANDARD  NORMAL  FITS  THE  (NORMALIZED} 


BUT  IT  TURNS  OUT  THAT  AS  n  6ETS  LAR6E.  THE  BINOMIAL'S  ASYMMETRY  IS 
OVERWHELMED.  AS  YOU  SEE  IN  THIS  EXAMPLE: 


-2  0  2  0  5  10 

Binomial,  n  =  2  and  p  =  0.3  Binomial:  n  =  20  and  p  =  0.3 


THEN  ALL  WE  WEEP  TO  FINP  PROBABILITIES  FOR  ANY  NORMAL  PISTRIBUTION  IS 
THE  SIN6LE  TABLE  FOR  THE  fTANPARP  NORMAL  F(z) 


z  -2  5  -24  -2.3  2.2  2.1  -20  19  18  -1.7  -1.6 

F(z)  0  006  0  008  0  011  0014  0.018  0.023  0.029  0  036  0  045  0.055 
z  15  1  4  1  3  1  2  -1.1  -10  09  -08  -0.7  06 

F(z)  0  067  0  081  0  097  0.115  0136  0159  0.184  0.212  0.242  0.274 
z  -05  0  4  0  3  02  0.1  0.0  0  1  02  0  3  0  4 

F(z)  0  309  0  345  0.382  0  421  0  460  0.500  0.540  0.579  0.618  0  655 
z  05  06  07  678  6.9  1.0  1.1  1.2  1.3  1.4 

F(z)  0.691  0.726  0.758  0  788  0  816  0  841  0.864  0  885  0.903  0  919 
z  1.5  1.6  1.7  18  1.9  20  21  2.2  2.3  24 

F(z)  0  933  0  945  0  955  0.964  0  971  0.977  0  982  0.986  0  989  0.992 
z  25 
F(z)  0  994 


0*5) 


4 


HERE  F(a)  -  Pr(z  F  a).  THE  AREA  UN  PER  THE  PENSITY  (URVE  TO  THE  LEFT  jl 
OFzsfl.  U 


(WE  (AN  ALSO 
6RAPH  THE 
(X)  RVE 
U  =  F(z). 

THE 

CUMULATIVE 
PROBABILITY 
IT  LOOKS 
LIKE  THIS.; 


o  a. 


THE  GENERAL  RULE  FOR  COMPUTING  NORMAL  PROBABILITIES  IS  THEREFORE 

Pr(a  $  X$  />)  =  F - F (r  J*  ) 

8f 


NOW  BACK  TO  PE  MOIVRE 
ANP  HIS  BINOMIAL 
APPROXIMATION.  •  LET'S 
LOOK  AT  A  BINOMIAL 
PlSTRIBUTION  WITH  n  =25 
TRIALS  ANP  p*. 5  (25 

(oin  flips.  say).  wg  (an 
COMPUTE  (OR  LOOK  UP  IN 
A  TABLE)  ANy  PROBABILITY, 
FOR  EXAMPLE,  RrflC*  14). 
rr  IS  .7070  EXACTLY 


NOW  CALCULATE  A  NORMAL  RANPOM  VARIABLE  X‘  WITH  THE  SAME  MEAN 
p-np  *  (25X  5)  *  12.5  ANP  5TANPARP  PEVlATlON  «r  =  np(1~p)  »  2.5- 


/  7070  VERSUS  \  / -  ^ 

(  .7257?  WHAT  KINP  \  (  UM  AN  N 
l  OF  APPROXIMATION  / 1  APPROXIMATE 


AH.  BUT  WE  CAN  PO  BETTER/ 
IF  YOU  LOOK  CLOSELY  AT  THE 
FIRST  HISTO&RAM.  YOU  SEE 
THE  BARS  ARE  CSHTCRCP  ON 
THE  NUMBERS.  THIS  MEANS 
Pr(X*$  14)  IS  ACTUALLY  THE 
AREA  UNPER  THE  BARS  LESS 
THAN  X  *  14.5  WE  NEEP  TO 
ACCOUNT  FOR  THAT  EXTRA  .5, 
ANP  IN  FACT, 

Pr(  X*  $  M.5)  -  Pr(z  $  a) 


A  VERY  GOOP  APPROXIMATION 
TO  .7070  INPEEPI 


RULE  OF  THUMB  14  WHENEVER  n  14  BI6  ENOU6H  TO  MAKE  THE  NUMBER  OF 
EXPECTEP  4UUC44 £4  ANP  FAILURES  BOTH  6REATER  THAN  RVC: 

np  ?  5  and  n(\—p)  ^  5 

you  (CAN  SEE  FROM  THESE  HISTOORAM5  THAT  THE  FIT  WHEN  p  »  O.l  15 
MEPIOOJE  OR  WORSE  UNTIL  «  REACHES  50,  MAKIN6  np  *  5. 


X?=2,  ^>=  a| 


77=10,  />=  o.i 


71=50,  X>=  0.1 


♦  Chapter  6« 

SAMPLING 

By  NOW.  AFTER  A  STEAPy  PIET  OF  CO IN*.  DICE,  ANP  ABSTRACT 
IPGAS,  TOO  MAT  BE  WONPERIN&  WHAT  ALL  THIS  STATISTICAL 
EQUIPMENT  WE'VE  BEEN  BUILPIN6  HAS  TO  PO  WITH  THE  REAL 
WOULD.  WELL.  NOW  WE  RE  FlNALLy  60IN&  TO  FINP  OUT. 


IN  THIS  CHAPTER,  WE  BE6IN  LOOTIN6  AT  THE  REAL  BUSINESS  OF  STATISTICS, 
WHICH  IS.  AFTER  ALL.  TO  SAVE  PEOPLE  TIME  ANP  MONEY.  PEOPLE  HATE  TO 
WASTE  TIME  POIN6  UHHECEifARY  WORK,  ANP  ONE  THIN6  STATISTICS  CAN  PO 
IS  TELL  US  EXACTLy  HOW  LAZy  WE  CAN  AFFORP  TO  BE. 


THE  PROBLEM  WITH  THE  WORLP  1$  THAT  THE  dOLLE£TlON5  OF  5TUFF  IN  IT 
ARE  50  LAR££.  IT'5  HARP  TO  6ET  THE  INFORMATION  WE  WANT: 


OUR  MET HOP  IS  TO  TAKE 

a  SAMPLE...  a 

RELATIVELY  SMALL 
SUBSET  OF  THE  TOTAL 
POPULATION,  THE  WAY 
POLLSTERS  PO  AT 
ELECTION  TIME 


THE 


SIMPLE  RANDOM  SAMPLE 


OPPOSE  WE  HAVE  A  LAR&E 
POPULATION  OF  OBJECTS  ANP  A 
PROCEPURE  FOR  SELECTING  /?  OF 
THEM.  IF  THE  PROCEPURE 
ENSURES  THAT  ALL  POiilBLE 
iAMPLEf  Of  n  OBJECT*  ARE 
EQUALLY  LIKELY.  THEN  WE  CALL 
THE  PROCEPURE  A  simple 

random  sample. 


THE  SIMPLE  RAN  POM  SAMPLE  HAS  TWO  PROPERTIES  THAT  MAKE  IT  THE 
STANPARP  AGAINST  WHICH  WE  MEASURE  ALL  OTHER  METHOPS 


UNBIASEP  EACH  UNIT  HAS  THE  SAME 
CHANCE  OF  BEIN&  CHOSEN 


INPEPENPENCt  SELECTION  OF  ONE 
UNIT  HAS  NO  INFLUENCE  ON  THE 
SELECTION  OF  OTHER  UNITS. 


BUT  THIS  IS  NOT  ALWAYS  EASy  MAKIN&  THE  FRAME  MAy  BE  PROHIBITIVELy 
COSTLY  CONTROVERSIAL,  OR  EVEN  IMPOSSIBLE  FOR  EXAMPLE.  AN  E.RA.  WATER 
QUALlTy  STUPy  NEEPEP  A  SAMPLING  FRAME  OF  LAKES  IN  THE  US..  SO  THEN 
SOMEBOpy  HAS  TO  PEOPE 


ARE  THERE  OTHER  WAyS  TO  SAMPLE  THAT  ARE  MORE  CFFlCICHT  ANP  COfT- 
CFFCCHVC  THAN  A  SIMPLE  RAN  POM  SAMPLE?  YES- IF  yOU  ALREAPy  KNOW 
SOMETHING  ABOUT  THE  POPULATION.  FOR  INSTANCE- 


CllISlPV  SAMPLING  6R0UPS  THE  POPULATION  INTO  SMALL 
"  W  ^  ^  ■  CLUSTERS.  PRAWS  A  SIMPLE  RAN  POM  SAMPLE  OF 

CLUSTERS.  ANP  OBSERVES  EVERYTHIN^  IN  THE  SAMPLEP  CLUSTERS-  THIS  CAN  BE 
COST-EFFECTIVE  IF  TRAVEL  COSTS  BETWEEN  RAN  POM  LY  SAMPLEP  UNITS  IS  HIEH- 


AN  EXAMPLE  IS  A  CITY 
HOUSING  SURVEY  WHICH 

pivipes  a  cr ry  into 

BLOCKS,  RANPOMLY 
SAMPLES  THE  BLOCKS, 
ANP  LOOKS  AT  EVERY 
HOUSING  UNIT  IN  EACH 
SAMPLEP  BLOCK. 


SVSfPiUWtif  SAMPLING  STARTS  WITH  A  RAN  POM  LY 

CHOSEN  UNIT  ANP  THEN  SELECTS  EVERY  -4th 
UNIT  THEREAFTER  FOR  INSTANCE.  A  HIGHWAY  TRAFFIC  iTVPY  MIGHT  CHECK 
EVERY  HUNPREPTH  CAR  AT  A  TOLL  BOOTH  THIS  PLAN  15  EASY  TO  IMPLEMENT 
ANP  ON  BE  MORE  EFFICIENT  IF  TRAFFIC  PATTERNS  VARY  SMOOTHLY  OVER  TIME 


'  EXCUSE  ME ..  VlOOLD  10\i  ^ 
Mint?  Am£WERIN6  Fifty 
or  -sixty  cjuesmo^?  > 


Ward  of  warning  #1: 

MOST  STATISTICAL  METHOPS  PEPENP  ON 
THE  INPEPENPENCE  ANP  LACK  OF  BIAS  OF 
THE  SIMPLE  RANPOM  SAMPLE.  THE  RESULTS 
AHEAP  APPLY  TO  THE  SIMPLE  RANPOM 
SAMPLE  ONLY-  FOR  OTHER  SAMPLING 
PROCEPURES.  THE  RESULTS  MUST  BE 
MOPIFlEP.  THE  PETAILS  APPEAR  IN 
SPECIALIZEP  SAMPLING  TEXTBOOKS  ANP 
COMPUTER  ALGORITHMS. 


A  COMMON  Ly  U5EP  METHOP  15  E5PE0ALLy  PRONE  TO  BIA5  IT'5  ZALLEP  AN 

opportunity  sample,  avoiping  all 


QUE5TIONNAIRE5  WENT  TO  WOMEN'5  ORGANIZATION*  CAN  OPPORTUNITY 
fAMPLO,  ONUy  4.9%  WERE  FILLEP  OUT  ANP  RETURNEP  (RCfPOHfC  81  A  f) 

50  HER  'RE5ULT5'  WERE  BA5CP  ON  A  5 AMPLE  OF  WOMEN  WHO  WERE  HIGHiy 
MOTIVATEP  TO  AN5WER  THE  5URVEy5  QOG5TION5.  FOR  WHATEVER  REA50N. 


THE  ASTUTE  REAPER  WILL  RECOGNIZE  THIS  AS  A  BERNOULLI  iYiTEM.  EACH 
NEW  TACK  15  THE  OUTCOME  OF  A  BERNOULLI  TRIAL  WITH  SOME  PROBABILITY  p 
OF  SUCCESS  O  E.  BEIN&  PEFECT-FREEl  ANP  PROBABILITY  1-p  OF  FAILURE 
ft.fi..  BEIN6  PEFECTIVE/ 


WE  THINK  OF  THIS  SITUATION  AS  IF  THERE  WERE  A  NIPPEN  BUT  REAL 
“ BERNOULLI  MACHINE"  WHOSE  PROBABILITY  p  ECN ERNS  THE  OUTCOMES  WE 
OBSERVE  IN  THE  SO-CALLEP  ‘REAL  WORLP’ 


98 


ANP  WE  ANSWER  WITH 
ANOTHER  QUESTION:  WHAT 
POES  THE  FIRST 
QUESTION  MEANT 


IN  FA/T,  THESE  p  VALUED  ARE  LOOKING  MORE  ANP  MORE  LIKE  A  RANPOM 
VARIABLE:  THE  SELECTION  OF  THE  /7-UNIT  SAMPLE  IS  A  RANPOM  EXPERIMENT, 
ANP  THE  OBSERVATION  p  IS  A  NUMERICAL  OUTCOME! 


TO  BE  PRECISE.  IF  X  IS 
THE  NUMBER  OF 
SUCCESSES  IN  THE  SAMPLE, 
THEN  X  IS  NOTHING  BUT 
OUR  OLP  FRIENP  THE 
BINOMIAL  RANPOM 
VARIABLE  (n  TRIALS. 
PROBABILITY  p)...  ANP  WE 
PEFlNE  THE  OBSERVED 
PROPORTION  TO  BE  THE 
RANPOM  VARIABLE 


/  916  P  THE 
RANPOM  VARIABLE,  \ — 
LITTLE  ft  \t* 

WALUE  FOR  A  particular 
V.  SAMPLE 1  .. _  . 


KNOWING  ALU  ABOUT  X.  WE  QUICKIV  C&iCVJVQ  A  FEW  FA £T5  ABOUT  P- 


1)  TME  MEAN  OF  P  15  B[P]  -  p 
Z)  THE  5TANPARP  PEVIATION  OF  P  15 


3)  FOR  LAR6E  n,  P  15 


APPROXIMATELY  NORMAL 


ANP  THERE  YOU  HAVE  IT  ALU  THE  OB5ERVEP  VALUE*  OF  P  WILL  BE  CENTEREP 
ON  p  (NOT  5URPRI5IN6LY.).  ANP  THEIR  5TANPARP  PEVIATION,  OR  5PREAP,  15 
PROPORTIONAL  TO  THAT  MA6K  NUMBER  WE  MENTIONEP  AT  THE  BE&INNIN&  OF 
THE  CHAPTER: 


ANP.  5IN£E  P  15  NEARLY  NORMAL.  WE  (AN  U5E  OUR  RULE  OF  THUMB  TO 
(ON£LUP£  THAT  APPROXIMATELY  be%  OF  ALL  E5TIMATE5  WILL  FALL  WITHIN  ONE 
5TANPARP  PEVIATION  OF  THE  TRUE  VALUE  p 


THE  5TANPARP  PEVIATlON  Of  P  15  A  MEA5URE 

of  THE  sampling  error. 

A5  WCVC  5EEN,  FOR  THE  BINOMIAL  P.  THI5 
5AMPUN6  ERROR  15  INVER5ELy  PROPORTIONAL 
TO  Vn--  INEREA5IN6  THE  5AM PLE  5IZE  By  A 
FACTOR  OF  4  REPU£E5  THE  5PREAP  aiP)  VY  A 
FACTOR  OF  2. 


5AMPLE  5IZE5  FOR  TAEK5,  ^  •  P-05 


1  4  1i 

29  WO 

WjOOO 

VzT 

9  W 

I  oo 

<r(p) 

.557  -1705  -P09 

dt\  xrvn 

x>ou 

LIN6UI5TI<C  NOT&  AN  ESTIMATE  15  A  5IN6LE  MEA5URE  OR  O05ERVATION.  AN 
ESTIMATOR  15  A  RULE  FOR  5ETT1N6  E5TIMATE5  IN  THI5  £A5E,  THE  E5TIMATOR 
15  THE  RANPOM  VARIABLE  P=  ~ 

1DZ 


MO*T  OF  STATISTIC  INVOLVE*  THE  4-*TEP  PROCE**  WEVE  Jl)*T  WALKEP 
THROUGH- 


PEFINE  POPULATION  WITH  UNKNOWN  FINP  AN  ESTIMATOR,  IT*  THEORETICAL 
PARAMETER  *AMPUN6  PI*TRIBUTION  ANP 


*TANPARP  PEVIATION 


IF  m  I*  THE  (UNKNOWN) 
MEAN  P/CKLE  LENGTH,  ANP 
<7-  IS  THE  STANPARP 
PEVIATION  OF  THE  Mf 
LCMTH  PtfTRIBUTION, 
THEN 

E[X/]  «/< 

<r(Xx)  »  <r 


4TRAN0E,  WOW 
MUdU  WG  KNOW  &C OT 

WE  PlPH'T  EVEN  KHOW 
v  WERE  PAHpOM  V(XRlABUE$y 
A  MINUTE  A60- 


FOR  EVERy  l  (BECAUSE  *x 
COULP  HAVE  BEEN  THE 
LEN&TH  OF  ANV  PICKLE). 


AS  BEFORE,  WCP  UKE  TO  KNOW  'HOW  £LOSE*  THIS  IS  TO  p,  MEANING,  IF 
THIS  SAMPUN&  WERE  PONE  MANY  TIMES,  WHAT'S  THE  PlSTRIBUTlON  OF  X? 
BECAUSE  WE  KNOW  ABOUT  X,.  X,,  ...  ANP  X„.  WE  ALSO  KNOW  THAT 


BUT  WE  PONT  KNOW  THE  SHAPE  OF  X'S  PlSTRIBUTlON  THE  SAMPLE 
PROBABILITY  PlSTRIBUTlON  p  WAS  ALMOST  NORMAL,  BECAUSE  IT  WAS  BASEP 
ON  A  BINOMIAL  RANPOM  VARIABLE  BUT  WHAT  ABOUT  X,  THE  SAMPLE  MEAN 
ESTIMATOR??? 


IT  TURN*  OUT  THAT  X  IS  ALSO  APPROXIMATE  NORMAL.'  THIS  FAMOUS 
RESULT  IS  CALLED  THE 

CENTRAL  LIMIT 
THEOREM 

rr  SAy*:  if  one  taxes  ranpom  samples 

OF  SIZE  n  FROM  A  POPULATION  OF  MEAN 
M  AND  STANDARD  PEVIATION  a.  THEN.  AS 
n  (SETS  LAR6E,  X  APPROACHES  THE 
NORMAL  PlfTRIBOTION  WITH  MEAN  /a 
ANP  STANPARP  PEVIATION  Jfa  .  THEN 

PrfrsXtb).  Vx(a-jL  <  7<  b-M  ) 

”/■&.  I 


WHAT  IS  REMARKABLE  ABOUT  THIS?  IT  SAy*  THAT  REGARDLESS  OF  THE  SHAPE 
OF  THE  ORIGINAL  PISTRlBUTION  UN  THIS  CASE.  OF  PICKLE  LENGTHS*  THE 
TAKIN&  OF  AVERAGES  RESULTS  IN  A  NORMAL.  TO  FINP  THE  PISTRlBUTION  OF 
X,  WE  NEEP  KNOW  ONLY  THE  POPULATION  MEAN  ANP  STANPARP  PEVIATION. 


THE  THREE  PROBABILITY  PENSITIES  ABOVE  ALL  HAVE  THE  SAME  MEAN  ANP 
STANPARP  PEVIATION.  DESPITE  THEIR  DIFFERENT  SHAPES.  WHEN  /7»10,  THE 
SAMPLING  DISTRIBUTIONS  OF  THE  MEAN,  A.  ARE  NEARLY  IDENTICAL. 


\ob 


MAKIN6  THE  ASSUMPTION  THAT  THE 
ORIGINAL  POPULATION  PlfTRIWTION 
WAi  NORMAL,  OR  NEARLY  NORMAL. 
■STUPENT-  WAS  ABLE  TO  CONCLUPE 


mBr THE  4TUff  <5ET6 
ImK  You  ppork, 
Rfflffll  M0  MATTER  < 
Pfll\^v*JLou«y^ 


t  IS  MORE  SPREAP  OUT  THAN  Z.  ITS 
■PLATTER-  THAN  NORMAL  THIS  IS 
BECAUSE  THE  USE  OF  S  INTROPUCES 
I  MORE  UNCERTAINTY.  MAKIN6  t 
1  "SLOPPIER-  THAN  Z. 


THE  AMOUNT  OF  SPREAP  PEPENPS  ON 
THE  SAMPLC  flZC  THE  6REATER  THE 
SAMPLE  SIZE,  THE  MORE  CONFIPENT  WE 
CAN  BE  THAT  S  IS  NEAR  cr.  ANP  THE 
CLOSER  t  6ETS  TO  z,  THE  NORMAL 


60SSET  WAS  ABLE  TO  COMPUTE 
TABLES  OF  t  FOR  VARIOUS  SAMPLE 
SIZES.  WHICH  WE  WILL  SEE  HOW  TO 
USE  IN  THE  FOLLOWING  CHAPTER 

ATtmPX  gVo 

I  MEANTIME,  \  VS 
/  0U6TTHIUK  V  | Li 

l  of  VJHW  You've)  fjft 

\  UWM*  /  » 

V  ARMED1  J  IMS* 


IN  THIS  CHAPTER.  WE  CONSIDERED  A  CENTRAL  PROBLEM  OF  RCAU.-WORLP 
*TAn*m*‘  HOW  TO  SELECT  A  SAMPLC  FROM  A  LAR6E  POPULATION  SO  THAT 
STATISTICAL  ANALySIS  CAN  BE  VALID.  BESIPES  THE  ’60LP  STANDARD"  OF  THE 
SIMPLE  RANPOM  SAMPLE,  WE  ALSO  PESCRIBEP  SOME  OTHER  SAMPLING  SCHEMES 


WE  FOUND  THAT  SAMPLE 
PROPORTIONS  p  WERE 
APPROXIMATELy  NORMALLy 
DISTRIBUTED,  WHILE  THE 
PlSTRIBlTTION  OF  THE 
SAMPLE  MAN  X  PEPENPEP 
ON  THE  SAMPLE  SIZE  FOR 
LAR6E  SAMPLES,  THE 
DISTRIBUTION  WAS 
APPROXIMATELy  NORMAL, 
WHILE  FOR  SMALL  SAMPLES, 
WE  USE  THE  STUDENTS  t 
DISTRIBUTION. 


in  THE  NEXT  TWO  CHAPTERS,  WE  LOOK 
AT  HOW  TO  USE  THESE  PlSTRtBUTlOMS  TO 
MAKE  fTATHTICJd.  INFUOUK*'  &IVEM  A 
SINGLE  OBSERVATION.  LIKE  A  POLITICAL 
POLL.  HOW  PO  WE  USE  OUR  KNOWLEP6E 
OF  p  AMP  X  TO  EVALUATE  IT? 


110 


IN  THE  LA*T  CHAPTER  WE 
LOOKEP  AT  $A/MPUN6. 
6TARTIN6  WITH  A  LAR6E 
POPULATION,  WE  IMA6INEP 
TAKIN&  MANY  SAMPLES.  ANP 
WE  PEPUCEP  HOW  SOME 
SAMPLE  ESTIMATORS  WERE 
PlSTRIBUTEP. 


(  it's  Urt  A  CRifAHAL 
UMv/C^Tl6ATlOK,  W^TSOU 


IHPUCnve  KASOVIHfi,  BY 
CONTRAST,  ARGUES  9AOCWARP 
FROM  A  SET  OF  OBSERVATION* 
TO  A  REASONABLE  HYPOTHESIS 


in  pcwcnve  rca$ohih&,  we  reason 
from  a  hypothesis  to  a  conclusion; 
•if  LORP  FASTBACK  COMMITTEP  MURPER, 
THEN  HE  WOULP  WIPE  THE  FINGER¬ 
PRINTS  OFF  THE  GUN  " 


IN  MANY  WAYS,  SCIENCE,  INQUIRING  STATISTICS.  IS  LIKE  PETECTIVE  WORK. 
BEGINNING  WITH  A  SET  OF  OBSERVATIONS.  WE  ASK  WHAT  CAN  BE  SAIP  ABOUT 
THE  SYSTEMS  THAT  GENERATEP  THEM 


tIS 


AFTER  CENSORING  THE  REMARKS  OF  A  FEW  (LRUMPy  OUTLIERS.  HOLMES  FINE’S 
THAT  990  VOTERS  FAVOR  HIS  CLIENT.  SENATOR  ASTUTE- 


AFTER  ASTUTE  CALAIS 
POWN,  HOLMES  EXPLAINS 
WHAT  HE  MEANS  9Y  99% 
COHFIKHCC:  HE  KNOWS 
THAT  HIS  ESTIMATION 
PRCKEPURE  HAS  A  99% 
PROBABILITY  Of 
PROPUON6  AN  INTERVAL 
£ONTAININ6  p,  I  E,  IN  HIS 
MANy  YEARS  OF  POLLING, 
p  HAS  FALLEN  WITHIN  THE 
£ONFlPEN£E  INTERVAL 
AROUNP  THE  OBSERVE? 
VALUE,  p,  99%  OF  THE 
TIME. 


HOLMES  NOW  TRANSLATES 
THE  ARCHERY  LESSON  INTO 
THE  LAN6UA6E  WE 
PEVELOPEP  UST  CHAPTER 


NOW  WE  PO  SOME  AL6CBRA-  By 
PEFlNlTION  OF  THE  Z  TRANSFORM. 

.9?=Pr(-l.9<>S  £rE  <!<}(,) 

oTpl 


.9^  -  Pr(f>-19fc<r(p1  <  f  S  p+f.9Wp)) 


WHIdH  IS  JUST  ANOTHER  WAy  OF  SAyiNfr  THAT  95*  OF  THE  p  'ARROWS’  LANP 
BETWEEN  p  -  I9ter(p)  ANP  p  +  I9t<r(p). 


NOW  WE  RE  IN  A  POSITION  TO  VIEW  THE  TARGET  FROM  BEHINP!  ONE  MORE 
TURN  OF  THE  AL6EBRA  dRANJC  MAKES  IT 

Pr(p-I  9t€r(p)  i  f  i  p+l.9Wpl) 


NCW  THE  FORMULA  15 

.95  =  !V(f-i9fc5e(f)<  p<  f  +  i<jfc s£<?>) 

A6AIN,  THI5  EQUATION  PESCRIBES  THE 
PROBABILITY  THAT  THE  TRUE.  FlXEP 
POPULATION  PROPORTION  FALL5 
WITHIN  THE  MHPOM  INTERVAL 

Ip  -  1-96X(p).  p  +  1 9tK(pn 

IF  WE  5AMPLEP  REPEATEPLY.  THE5E 
INTERVALS  WOULP  CO/ER  p  91%  OF 
THE  TIME. 


NOW  OUR  PROBABILITY  CALCULATION  15  PONE,  ANP  IT'S  TIME  FOR. 


HC  MAKC5  USE  OF  STEP  ONE  TO 

ItewiSi 

y  ■  .  y 

HE  CONCLUPES  THAT  WE  CAN  HAVE 

THE  RAN6E 

p  ±  1 9&5e(p) 

» .SSOi  04OC.CW7) 

v55o±  .031 

THI5  15  WHAT  POLL5  MEAN 
WHEN  THEY  REFER  TO  THEIR 
"MARGIN  OF  ERROR  *  IN  THI5 
CASE,  HOLME5  FOUNP  THAT 
.519  $  p  $  .501, 

IN  OTHER  WORP5  THAT 
p  »  19%  WITH  A  1%  MARGIN  OF 
ERROR  (POLLS  TYPICALLY  U5E  A 
99%  CONFIPENCE  LEVELS 

119 


THIS  PA6E  SHOW*  THE  RESULTS  Of  A  COMPUTER  SIMULATION  OF  TWENTY 
SAMPLES  Of  SIZE  n  -  WOO  WE  ASSUMEP  THAT  THE  TRUE  VALUE  OF  p  »  5-  AT 
THE  TOP  YOU  SEE  THE  SAMPLING  PISTRIBUTION  OF  p  ("NORMAL,  WITH  MEAN  p 
ANP  BELOW  ARE  THE  95*  (ONFlPEN£E  INTERVALS  FROM  EAZH 

SAMPLE.  ON  AVERAGE,  ONE  OUT  OF  TWENTY  (OR  5*)  OF  THESE  INTERVALS  WILL 
NOT  (OVER  THE  POINT  p  *  5. 


120 


ALTH0U6H  95* 

ccMf ipen^e  is 

6 OOP  ENOU6H  FOR 
NEWSPAPER  POL  15, 
IT  ISN'T  600(7 
ENOU6H  FOR 
fCHATOR  A4TUTC 
HE  WAKIT5  99*' 


HOW  TO  INAREA5E  £ONFlPGN£C?  U5IN6 

THE  AR^HERy  TAR6ET,  WE  CM  5EE  TWO 

WAY5:  ONE  15  TO  INCREASE  THE  UZE 

of  the  circle  tod  praw 


ANP  ANOTHER  WOULP  BE  TO  IMPROVE 
THE  AIM  OF  THE  ARCHER  IN  THE  FIR5T 
PLAAE,  50  HER  ARROW5  LANP  CL05ER 
TO  THE  BULL'5-EyE 


THE  FIR5T  METHOP  15  EQUIVALENT  TO  WIPCHIH6  TUC  COUFIPCUCC  INTERVAL 
THE  6R EATER  THE  MAR6JN  OF  ERROR,  THE  MORE  CERTAIN  yOU  ARE  THE  TRUE 
VALUE  OF  p  LIE5  IN  THE  INTERVAL 


121 


MAyBE  IT'S  TIME  TO  5EE  EXAOTLy 
HOW  WE  FINP  THE  ENPS  OF 
THESE  £ONFlPEN£E  INTERVALS 


THE  RELEVANT  NUMBER 
WERE  WE  USUALLY  OLL  a. 
IT  MEASURES  TWE 
PlFFEREN£E  BETWEEN  TWE 
PESIREP  CONFIPEN^E 
LEVEL  ANP  (CERTAINTY  FOR 
EXAMPLE.  WHEN  TME 
£ONFlPEN£E  LEVEL  IS  95*. 
OR  0.99,  a  IS  09  SO  WE 
SPEAK  OF  THE  (!-*)■  too* 
tOHFlPEHCE  LEVEL 


WE  £AN  FINP  Z0/!  STRAIGHT 
FROM  THE  STANPARP  NORMAL 
TABLE  (PA6E  04).  IT'S  TWE 
POINT  WITH  THE  PROPERTY 


z  -2.5  -2.4  -2.3 

F(z)  0.006  0.008  0.011 

z  -2.0  _-1.9  1.8 

F(z)  0.023  (£029  0  036 


Pr(z  *  z.02 5 )  =  .029 


TO  MAKE  A  99*  TON FIPENTO  INTERVAL.  WE  WE  THAT  TABLE  TO  WRITE 
99  •  Pr(p  - 1 5ei€(p)  ^p^p*  mf>£(p)) 

WM£H  WE  $LOPPlLy  ABBREVIATE  A$ 

p~p±2.<!0/\J^l7'^ 

-■w 

"•55±0/H 

with  99%  cohfiprke.  yr1 


60  THEY  PO  THE  POLL. 
AMP  60  INTO  THE 
ELECTION  WITH 
COHFiPCHCC- 


WHAT  HAPPENEP  I*  THAT  POLITICIAN*  ARE  NOT  ELECTEP  BV  POLL*.' 


SOME  PROBLEM*  WITH  POLL*.  A*  OPPO*EP  TO  ELECTION* 


THERE  I*  UO  WAY  FOR  A 
POLLCTER  TO  6£T  iN*lPE 
A  POTENTIAL  VOTER  * 
HEAP  ANP  KNOW  IF  *HE  * 
60IN6  TO  VOTE.  IF  *H£  * 
LYIN6,  OR  IF  *HE  *  60IN6 
TO  ZHAN6E  HER  MINP 
BEFORE  ELECTION  PA y 
LAR6E  *  AMPLE  *IZE* 
CANNOT  REPUZE  THE*E 
KiNP*  OF  ERROR*. 


SINCE  THESE  ERRORS  CAN  BE 
LARGE,  IT  SELPOM  PAYS  TO  TAKE 
A  VERY  LARGE  RAN  POM  SAMPLE 


IN  TME  LAST  FIVE  PRESlPENTlAL  ELECTION*,  T ME  GALLUP  POLL  HAS  INTER- 
VIEWEP  FEWER  THAN  4,000  VOTERS  FOR  EACH  ELECTION  YET  IN  ALL  FIVE 
ELECTION*,  THE  GALLUP  ORGANIZATION  *  ERROR*  IN  PREPlCTiNG  THE 
PRE*IPENTIAL  ELECTION  OUTCOME  HAVE  BEEN  LE**  THAN  2% 


THEIR  *UCC£**  i*  PUE  TO  THEIR  USE  OF  E*TIMAT0R*  THAT  ACCOUNT  FOR 
NONRESPONSE.  ANP  THEY  *CREEN  OUT  ELIGIBLE  VOTER*  WHO  ARE  NOT 
LIKELY  TO  VOTE 

/'WHAT  ABOUT) /otWPE  OF  ''\ 

V.  THESE ?/ - '  TEX  AG  AHO  \ 

\CK IZAGO,  I  thiHK  I  TO  *UMMARIZE,  E*T|MATEP 
1^  V  RE  GAFE^/  proportion  -  TRUE  PROPORTION  + 
77  BIA*  +  RAN  POM  *AMPLlNG  ERROR. 

^IE'\(irrO  EVEN  POLL*TER*  HAVE  LlMlTEP 

t§§:  )lii— V-  FUNP*.  THEY  WI*£LY  CHOO*E  TO 

/-T\  f  I \r\  /Os  *PENP  THEIR  MONEY  REPOC/A/G 

MWlJOJ  [/jly  1U\Q  bias,  rather  than  increa*ing  the 

-l— *AMPL£*  BEYONp  A, 000  VOTER* 


127 


Confidence  Intervals 

forH  ^ 


UP  TO  NOW,  WE  VC  BEEN 
LOOKING  AT  CONFIDENCE 
INTERVALS  FOR  A  PROPOR 
TlON  p  OF  A  POPULATION. 
EXACTLY  THE  SAME 
REASONING  WORKS  FOR 
THE  POPULATION  MEAN  ,K. 


IN  THE  LAST  CHAPTER  (P-  109).  WE  SAW  THAT  THE  DISTRIBUTION  OF  SAMPLE 
MEANS  X  16  APPROXIMATELY  NORMAL,  CENTERED  ON  THE  ACTUAL  POPULATION 
MEAN  /c.  WITH  STANDARD  DEVIATION  %f,  WHERE  cr  IS  THE  POPULATION 
STANDARD  DEVIATION.  SO,  FOR  LAR6E  n. 


95  =  Pr(-1.96  $  l  $  1.96) 

X-u 

«  PrC-1.96  $  *^r=-  $  1.96) 
yVn 

A4AIN,  NOT  KN0WIN6  cr.  WE  REPLACE  cr 
WITH  S,  THE  SAMPLE  STANDARD  DEVIATION: 


.95  -  PrC-1.96  ^  <  1-96) 


/TuWitlfc  THE  \ 
j  SAME  ALfrCPPA  I 
\  CRAuK  A*  y 


THE  TERM  ^  IS  CALLED  THE  SAMPLE  4TANPARV  CRROR,  AND  WRITTEN 
SEW.  WE  CONCLUDE  THAT 

.95  «  Prtf-WtStOO  «  m  *  X*1.9ti&X)) 


JU9T  A$  BEFORE,  WE  HAVE 
FOUNP  THAT  THE  RANPOM 
INTERVAL 

X  ±  1 .96*£(X) 

OVER*  THE  TRUE  MEAN.  ,u,  WITH 
PROBABIUT/  .99-  90  NOW  WE  CUi 
CML  IN  SHERLOZK  HOLME*  TO 
MAKE  A  STATISTICAL  INFERENCE 
BASEP  ON  A  SIN6LE  SAMPLE  OF 
SIZE  n  WITH  MEAN  X. 


(t-»M 00*  CON FlPEN££  INTERVAL  19 


POPULATION  MEAN.  M 

M  *  *  ± 

WHERE  K(Z)  -  %j 


POPULATION  PROPORTION,  p 

P  -  P  *  ^(P~> 

7 

WHERE  91  (p')-=-'J&Qr^ 


Student's  t  (again!) 

AS  WE  SAW  IN  CHAPTER  6,  THE  STATISTIC  t 


HAS  AN  APPROXIMATELY  NORMAL 
PISTRIBUTION  ONLY  WHEN  IT  IS 
COMPUTER  USING  A  LAPSE  SAMPLE 
FOR  SMALL  SAMPLES  C«-5.  I O,  IS-). 
THIS  IS  NO  LONGER  THE  CASE,  ANP 
WE  HAVE  TO  USE  THE  STUPENTS  t 


LET'S  LOOK  AT  t  A  LITTLE  MORE  CLOSELY  WE  MENTIONEP  THAT  THE  t 
PISTRIBUTION  IS  MORE  SPREAP  OUT  THAN  THE  NORMAL,  ANP  THAT  THE 
AMOUNT  OF  SPREAP  PEPENPS  ON  THE  SAMPLE  SIZE. 


t,  SMALLER 
/  SAMPLE 


WHAT  ITS  PISCOVERER 
60SSET  PIP  WAS  TO 
QUANTIFY  THIS 
RELATIONSHIP,  IF  n  IS 
THE  SAMPLE  SIZE.  HE 
SAIP,  THEN  CALL  /7-t 
THE  NUMBER  OF 

degrees  of 
freedom 


'  THE  GENERAL  IPEA  GIVEN  n 
I  PIECES  OF  PATA  X,.  X,.  X,v 
YOU  USE  UP  ONE  'PE&REE 
OF  FREEPOM"  WHEN  YOU 
COMPUTE  X  .  LEAVING  n  -  I 
INPEPENPENT  PIECES  OF  , 
V  INFORMATION.  / 


60TTET  60MFVTEP  TABLET  Of 
THE  t  [ATTRIBUTION  FOR 
PIFFERENT  TAM  PIE  TIZET-I E.. 
PE&REET  OF  FREEPOM.  WE 
REPEAT,  THE  MORC  P&RCCf  OF 
FRCCPOM,  the  zloter  t 
BECOMES  TO  THE  STANPARP 
NORMAL. 


FOR  A  0-a)-WO%  COWiPEHCE  INTERVAL.  WE  FlNP  THE  ZRITIZAL  VALUE  ta 
TUZH  THAT  Prft?  ts.)  =  y  .  HERE  IT  A  THORT  TABLE  OF  CRITICAL  VALUET 
FOR  THE  t  PITTRIBUTION 


1-a  I  00 
a  20 


PE6REET  OF 
FREEPOM 


IP 

90 

loo 


109 

1.97 

1.91 

129 


.90 

■\0 

Of 

491 

1.01 

170 

1.44 

1.45 


.99  -99 

Of  Ol 

025  .005 

1271  49.44 

2.29  4.14 

204  2.79 

1.90  2.49 

1.94  2.00 


151 


EACH  COLUMN  REPRESENTS  A  FlXEP  LEVEL  OF  CONFIDENCE,  WITH  INCREASING 
NUMBERS  OF  DEGREES  OF  FREEDOM  THE  HIGHER  THE  DEGREES  OF  FREEDOM, 
THE  CLOSER  THE  CRITICAL  VALUE  GETS  TO  Za/.  THE  CRITICAL  VALUE  OF  THE 
NORMAL  DISTRIBUTION. 


WE  DERIVE  THE  WIDTH  OF  OUR 
CONFIDENCE  INTERVAL  DIRECTLY 
FROM  THE  DEFINITION  OF  t 


/  NOTe.  'T's  \ 
/  EKACTlV  LIKE  \ 
THE  CASE  Of  ' 
V  A  LAR&6  SAMPLE, 
f  but  With  -t:  r 
l  INSTEAD Cf  Z.' ) 


THEN,  FOR  CONFIDENCE  LEVEL 
(t-a)-IDD* , 

(l-«)  ^■Pr(x-t0^Q(X)  <  p 


FROM  WHICH  WE  INFER  GIVEN  A 
SlNGLE_SAMPLE  OF  SIZE  «  AND  ^ 
MEAN  WE  CAN  BE  (\-a)-\00%  ■ 

CONFIDENT  THAT  THE  POPULATION  W 
MEAN  /<  FALLS  IN  THE  RANGE 

p  =  X±t'  SEC?) 

WHERE  SE^;  *  AND  to  IS  THE 
CRITICAL  VALUE  OF  THE  t  DISTRIBUTION 
WITH  n-t  DEGREES  OF  FREEDOM. 


llW  ■  the  DERIVATION  OF 
THE  t  DISTRIBUTION  DEPENDED  ON 
THE  ASSUMPTION  THAT  THE  SAMPLE 
WAS  FROM  A  NORMAL  POPULATION.  IN 
PRACTICE,  CONFIDENCE  INTERVALS 
BASED  ON  THE  fc  WORK  REASONABLY 
WELL,  EVEN  WHEN  THE  POPULATION 
DISTRIBUTION  IS  ONLY  APPROXIMATELY 
MOUND-SHAPED. 


example:  SUPPOSE  CUAMCLCoN  MOTOR*  HAS  TO  /RASH  TEST 


SO  WHERE  /AN  WE  PLACE  THE  MEAN  WITH  95*  /ONFfPEN/E?  WE  FINP  OUR 
/RlTt/AL  VALUE  tOT,  WITH  4  PE6REES  OF  FREEPOM 


1-ft 


90 


PE6REES  OF 
FREEPOM 


ft/2 


IP 


I  3.09 


2  1.09 

3  104 

4  1.53 

5  1 M 


90  .95  .99 

.10  .05  .01 

05  .025  005 


4.31  12.71  43.44 

2.92  4  30  9  92 

2.35  3.10  5.04 

2.13  2-70  400 

201  2.57  403 


IM 


t-W  PLU6  IT  K 


a  =  *  t  2 n  % 

-  «0i  27g(*%) 

-  540  *  372 


*0  THE  BEST  WE  ON  SAY  WITH  99*  CONFlPENCE  19  THAT  THE  AVERAGE 
PAMA6E  WILL  LIE  BETWEEN  *160  ANP  *912. 


TO  COMPUTE  THIS  CONFlPENCE  INTERVAL  USIN&  STUPENT'S  fc  WE  HAVE  MAPE 
AN  UHfTATCP  AffUMPTIOM  WE  ASSUMEP  THAT  CRASH  REPAIR  COSTS  ARE 
APPROXIMATELY  A IORMAUY  PtfTMBUTCP,  l.E,  IF  WE  CRASHEP  I00P 
CHAMELEONS,  THE  H1STO6RAM  OF  REPAIR  COSTS  WOULP  BE  SYMMETRICAL  ANP 
MOONP-SHAPEP  WE  CAH  HOT  KHOW  THI f  FROM  5  PATA  POINTS  ALONE  -  BUT 
MAYBE  YEARS  OF  EXPERIENCE  WITH  EARLIER  MOPELS  PROVIPE  NORMALLY 
PISTRIBUTEP  COST  H1STO6RAMS  FOR  FRONT  ENP  REPAIRS:  INFORMATION  WHICH 
WOULP  TENP  TO  SUPPORT  OUR  USE  OF  STUPENT'S  t 


135 


THE  STANPARP  ERR Oft 


zJ?E(p)  z„SE(X)  ta$EOO 


ANP  EMU  OF  THOSE  STANPARP  ERRORS  IS  PROPORTIONAL  TO  THAT  MA6I-' 
NUMBER: 


13  b 


♦  Chapter  8  » 

HYPOTHESIS  TESTING 


NOW  WE  ENTER  A  NEW  AREA...  GOVERNMENT, 
BUSINESS,  ANP  THE  HARP  ANP  SOFT  SCIENCES  ALL 
USE  ANP  OFTEN  ABUSE  THESE  TESTS  OF 
SIGNIFICANCE  IT'S  ALL  ABOUT  ANSWERING  THE 
QUESTION,  “COULD  THCtC  OBtCRVATlOH S 
RCALLY  HAVE  OUURRCP  BY  £MN£C?" 


137 


WE  BE6IN  WITH  AN  EXAMPLE 
FROM  THF  LAW:  A  COMPOSITE 
OF  SEVERAL  CASES  AR6UEP  IN 
THE  SOUTH  BETWEEN  I9AO 
ANP  I960.  IN  WHICH  EXPERT 
WITNESSES  PRESENTEP  THE 
CASE  FOR  RACIAL  Mi  IU 
JURY  iCL&moH- 


/pone  ^ 

( couiciptMce.! 


PANELS  OF  JURORS  ARE  THEORETICALLY  PRAWN  AT  RANPOM  FROM  A  LIST  OF 
ELIGIBLE  CITIZENS  HOWEVER.  IN  SOUTHERN  STATES  IN  THE  '5<?S  ANP  APS.  FEW 
AFRICAN  AMERICANS  WERE  FOUNP  ON  JURY  PANELS,  SO  SOME  PEFENPANTS 
CHALLEN6EP  THE  VERPICTS.  ON  APPEAL,  AN  EXPERT  STATISTICAL  WITNESS  6AVE 
THIS  EVIPENCE: 


50%  OF  ELIGIBLE  CITIZENS 
WERE  AFRICAN  AMERICAN 


ON  AN  BP  PERSON  PANEL 
OF  POTENTIAL  JURORS, 
only  Four  were 

AFRICAN  AMERICANS. 


SMB.::--;’ 


FOR  THE  SAKE  OF  ARGUMENT, 
SUPPOSE  THAT  THE  SELECTION  OF 
POTENTIAL  JURORS  WAS  MUPOM. 
THEN  THE  NUMBER  OF  AFRICAN 
AMERICANS  ON  THE  90 -PERSON 
PANEL  WOULP  BE  THE  BINOMIAL 
RAN  POM  VARIABLE  X  WITH 
-90  TRIALS  ANP  p  ».5. 


LET'S  FOLLOW  THE  PROCESS  A6AIN  TO 
SORT  OUT  THE  FOUR  FORMAL  STEPS  OF 
STATISTICAL  MypOTMESIS  TESTING 


Step  I .  formulate  all 

HYPOTHESES 


Ho*  THE  WU  HYPOTHESIS.  IS 
USUALLY  THAT  THE 
OBSERVATIONS  ARE  THE  RESULT 
PURELY  OF  CHANCE 

Ha#  ™E  alternate  HYPOTHESIS. 

IS  THAT  THERE  IS  A  REAL 
EFFECT.  THAT  THE 
OBSERVATIONS  ARE  THE 
RESULT  OF  THIS  REAL  EFFECT. 
PLUS  CHANCE  VARIATION. 


Step  2.  THE  TEST  STATISTIC. 

IDENTIFY  A  STATISTIC  THAT  WILL  ASSESS 1 
THE  EVIDENCE  A&AlNST  THE  NULL 
HYPOTHESIS. 


IN  THE  COURT  CASE.  Ho  SAYS  THE 
JURY  WAS  RANDOMLY  CHOSEN 
FROM  THE  WHOLE  POPULATION. 
AFRICAN  AMERICANS  HAVE 
PROBABILITY  p~  .SO  OF  BEIN6 
CHOSEN. 

H«  SAYS  THAT  AFRICAN  AMERICANS 
ARE  LESS  LIKELY  THAN  THEIR 
PROPORTION  IN  THE  POPULATION 
TO  BE  SELECTED  FOR  A  JURY 
PANEL  p  <  .SO  . 


STATISTIC  IS  THE  BINOMIAL  RANDOM 
VARIABLE  X  WITH  p-  SO  AND 
n- 90. 


Step  3.  P- VALUE- 

A  PROBABILITY  STATEMENT  WHICH 
ANSWERS  THE  QUESTION.  IF  THE 
NULL  HYPOTHESIS  WERE  TRUE,  THEN 
WHAT  IS  THE  PROBABILITY  OF 
OBSERVING  A  TEST  STATISTIC  AT 
LEAST  AS  EXTREME  AS  THE  ONE  WE 
OBSERVED 

/TMC  SMALLER^ 
(Hj^  [  THE  P-VALUE. 

\  »  Z7  I  THE  STRON6ER 

^  THE  EVIPENCE  j 
/  !  \  N,  A6AINST  Ho  / 


Step  4.  COMPARE  THE 
P-VALUE  TO  A  FlXEP  fl6HIFI£AH£E 
LEVEL,  a. 

a  ACTS  AS  A  CUT-OFF  POINT 
BELOW  WHICH  WE  AGREE  THAT  AN 
EFFECT  IS  STATISTICALLY  SIGNIFI¬ 
CANT  THAT  IS,  IF 


THEN  WE  RULE  OUT  THE  HULL 
HYPOTHEC*  He  ANP  AGREE  THAT 
SOMETHING  ELSE  IS  GOING  ON. 


HI 


IN  SCIENTIFIC  WORK,  A  FlXEP  a- LEVEL  OF  Of  OR  0\  16  OFTEN  1)6 EP.  THESE 
FlXEP  LEVELS  ARE  A  HOLPOVER  ARTIFACT  FROM  THE  PRE-COMPUTER  ERA, 
WHEN  WE  HAP  TO  REFER  TO  TABLES,  WHICH  WERE  PRINTEP  ONLY  FOR 
SELECTEP  CRITICAL  VALUES-  STILL,  MANY  SCIENTIFIC  JOURNALS  CONTINUE  TO 
PUBLISH  RESULTS  ONLY  WHEN  THE  P -VALUE  $  06. 


/  EVEN  THOUGH  \ 
/  1  THWE  OUT  OF  \ 
/  20,  RESULTS  WITH  \ 
'  A  Sl£rN|F|CAN£E  / 
,  LEVEL  of  p<  .05/ 
K  ARE  PAUSE//  / 


LARGE  SAMPLE 

SIGNIFICANCE  TEST  FOR 
PROPORTIONS 

THE  JURy  EXAMPLE  WAS  A  SPECIAL  CASE 
OF  A  GENERAL  PROBLEM  THE  NULL 
HyPOTHESlS  HAP  THE  FORM  p  -  pc, 

WHERE  po  WAS  SOME  PROBABILITY  flN 
THIS  CASE,  3\  NOW  LET'S  LOOK  AT 
SUCH  PROBLEMS  GENERALLY  LET  S 

test  we  uypowem  p~p * 


Step  1 . 

THE  HULL  HyPOTHESlS  IS 
M<,  !  ■=■  p0 

THE  ALTERHATC  HyPOTHESlS  PEPENPS 
ON  THE  PIRECTION  OF  THE  EFFECT 
WE  ARE  LOOKING  FOR.  IN  SENATOR 
ASTUTE'S  CASE, 

n» .  P>r>. 

BUT  IN  OTHER  CASES,  THE  ALTERNATE 
HyPOTHESI5  MISHT  WELL  BE 

Ma  :  P  <  Po 
OR 

Ha: 

FOR  EXAMPLE.  IN  THE  JURy  5E LEC¬ 
TION  EXAMPLE.  THE  ALTERNATIVE 
HyPOTHESlS  WAS 
Wa  p  <  0.5 

ANP  AT  OTHER  TIMES,  WE  ARE 
INTERE5TEP  IN  KNOWING  THAT  p  15 
PIFFERENT  FROM  SOME  VALUE  /*,. 

FOR  INSTANCE,  IN  TESTING  FOR  A 
FAIR  COIN,  WE  HAVE  AN  ALTERNATE 
HyPOTHESlS  OF 

Wa  p*  0.5 


BUT  HAVE  NO  A  PRIORI  OPINION 
ABOUT  WHETHER  HEAPS  OR  TAILS 
WILL  COME  UP  MORE  OFTEN. 


Step  2  •  THE  TEST  STATISTIC  IS 
P'Po 

°**  ip,(\'p,Zfw 

WHICH  MEASURES  HOW  FAR  p  PEVIATES 
FROM  po  UNPER  THE  NULL 
HyPOTHESlS,  Zen  WAS  THE  STANPARP 
NORMAL  PISTRIBUTION. 

Step  3  •THE  P-VALUE  PEPENPS 
ON  WHICH  ALTERNATE  HyPOTHESlS  IS 
RELEVANT: 

aJ^MT-HANPEP-  Ha  P>  p0 

USES  P-VALUE  PrCZ  >  Z^J 


0  W 

b)  "LEFT- HAN  PEP"  Ua  p<p0 
USES  P-VALUE  Pr(Z  <  Z0li) 


O 


C)  TWO-SIPEP"  Hfl  :  p*p0 
USES  P-VALUE  Pr(\Z\  >  1*0*1) 


IN  THE  £A*E  OF  SENATOR  ASTUTE: 


BUN£M  OF  PROPPEP-OUT  SCIENCE  MAJORS,  THE  6RC*ER*  THINK  *OUNP$ 

ABOUT  RI6-MT. 


they  pull  our  a 

SIMPLE  RAN  POM 
SAMPLE  OF  49  BOXES, 
WEIGH  EACH  ONE.  ANP 
PETERMINE  THE 
SAMPLE  S  SUMMARY 
STATISTICS: 

X  »  15.90  oz. 

5  =•  35  oz. 

A  LITTLE  LIGHT-BUT 
SIGNIFICANTLY  SO? 


150 


THIS  IS  WHAT  IS  IALLEP  A  T/FE  /  flMOft  AN  ALARM  WITHOUT  A  FIRE- 
CONVERSE Ly,  A  TYPE  II  ERROR  IS  A  FIRE  WITHOUT  AN  ALARM  EVERY  COOK 
KNOWS  HOW  TO  AVOIP  A  TYPE  I  ERROR-  JUST  REMOVE  THE  BATTERIES. 
UNFORTUNATELY.  THIS  INCREASES  THE  INOPEN^E  OF  TYPE  II  ERRORS' 


wg  can  summarize  this  in  a  two-by-two  pension  table. 


NO  FIRE  FIRE 

NO  ALARM  J  NO  ERROR  I  TYPE  n  ] 

ALARM  I  TYPEI  NO  ERROR 


NOW  THINK  OF  THE  MULL  HYPOTHESIS  AS  THE  CONPITION  OF  A JO  FIRE,  WHILE 
THE  ALTERNATE  HYPOTHESIS  IS  THAT  A  FIRE  IS  BURNING  THE  ALARM 
CORRESPONPS  TO  REJECTION  OF  THE  NULL  HYPOTHESIS- 


TRUE  STATE 


ACCEPT  H„ 
REJECT  U0 


_ H0  _ H« 

NO  ERROR  TYPE  n 

TYPE  I  NO  ERROR 


ALL  THE  SIGNIFICANCE  TESTS  WE  PIP  EARLIER  IN  THIS  CHAPTER  EMPHASIZEP 
THE  PROBABILITY  OF  COMMITTING  A  TYPE  I  ERROR-I-E-,  THE  PROBABILITY  OF 
OUR  OBSERVATIONS  OCCURRING  IF  H0  WAS  TRUE-  WE  PEMANPEP  THAT 

P/TREJECTING  W0\W0)  *  PrfryPE  I  ERROR  I  W0)  -  a 

1  -a  MEASURES  OUR  CONFIPENCE  THAT  ANY  ALARM  BELLS  WE  HEAR  ARE 
GENUINE-  HIGH  CONFIPENCE  MEANS  RARELY  SETTING  OFF  FALSE  ALARMS. 


BUT  SOMETIME?  WHAT  WE  REALLy  WANT  TO  KNOW  IS  THE  dHANcE  OF  MAKING 

a  rmr  //  error'  in  other  worps,  how  sensitive  is  our  -alarm  sysrEM’ 
WHEN  THE  ALTERNATE  HypOTHE5l5  IS  TRUE? 


IN  THE  PAST,  FACTORIES  PISCHAR&IN&  £HEMl£ALS  INTO  WATERWAyS  WERE 
REOUlREP  TO  SHOW  THAT  THE  PIS£HAR6E  HAP  NO  EFFECT  ON  THE  POWN- 
STREAM  WILPUFE-  THAT'S  H0  THE  POLLUTER  £OULP  CONTINUE  AS  LON6  AS 
THE  NULL  HyPOTHESlS  WAS  NOT  REJKTEP  AT  THE  05  Sl&NlFiaME  LEVEL. 


LET'S  FORMALIZE  THIS 
IPEA.  TO  PESZRIBE  THE 
PROBABILITY  OF  A  TYPE 
II  ERROR.  WE  BREAK  OUT 
ANOTHER  GREEK  LETTER- 
BETA,  OR  P 

p  *  Prciutrrm  W0  |H«) 

-  Pr(TYPG  n  ERROR |Ha 

THE  POWER  OF  A  TEST 
IS  PEFINEP  AS  1-^-  rr  s 

Pr  (REJECTING  Hp  |Ha  ). 


YOU'LL  BE  HAPPY  TO  KNOW  THE 
ENVIRONMENTAL  REGULATORS  HAVE 
MOV EP  IN  THE  PlREGTION  OF  REQUIRING 
POLLUTION  MONITORING  PROGRAMS  TO 
SHOW  THAT  THEY  HAVE  A  HIGH 
PROBABILITY  OF  PETECTING  SERIOUS 
POLLUTION  EVENTS  THE  REQUlREP 
POWER  AHALYflf  OFTEN  REVEALS 
HIPPEN  FLAWS  IN  THE  MONITORING 
PROGRAM. 


154 


ONE  WAy  TO  VISUALIZE  THE  EFFECT  OF  A  TESTS  POWER  IS  9r  GRAPHING  THE 
PROBABILITY  OF  REJECTING  H0  AGAINST  THE  ACTUAL  STATE  OF  THE  SySTEM  IN 
THE  CASE  OF  A  SMOKE  ALARM,  THE  PROBABILITY  CLIMBS  TOWARP  1  AS  THE 
SMOKE  GETS  THICKER. 


FOR  THE  E  RA.  WATER  OUALITy  EXAMPLE,  THE  HORIZONTAL  AXIS  IS  THE  TRUE 
CONCENTRATION  OF  POLLUTANT  IN  THE  WATER 


j! 

— —  AOIPEN  WW 

*  •  •  .  PONT  ttoa  (OAT 

ptiwrtMU~  ^ 

MOVC  STANPMP  PIPE  IWV4TAIAL  SOLVENT 

POLLUTANT  CONCENTRATION 

HERE  ARE  THE  POWER  CURVES  FOR  THREE  MONITORING  PROGRAMS.  THE  fAVE 
EVERY  LAfT  6URPY  (COSTS  <5  MILLION;,  THE  60LPCN  MEAN  (COSTS 
*500,000},  ANP  PONT  ROCK  TEE  BOAT  (ALSO  COSTS  *500,000,  BUT  THEy  PUT 
ON  A  GOOP  SHOW/;.  THE  HIGHER  THE  TESTS  POWER,  THE  STEEPER  THE  CURVE 

♦  Chapter  9  ♦ 

COMPARING 
TWO  POPULATIONS 


IN  WHl£H  WE  LEARN  SOME  NEW  RKIPES  U5IN& 
OLP  IN6REPIENTS- 


157 


BUT  WHAT  MAKES  STATIST  ICS  ALMOST  AS  CHALLENGING  AS  COOKING  IS  THE 
VARIETY  LIKE  AW  EXPERT  COOK.  THE  STATISTICIAN  CAN  "TASTE"  THE 
INGREPIENTS  IN  A  PROBLEM  ANP  THEW  FINP  THE  MOST  EFFECTIVE  WAY  TO 
COMBINE  THEM  INTO  A  STATISTICAL  RECIPE- 


(THE  REASON  COOKBOOKS  ANP  STATISTICAL  METHOPS  TEXTS  ARE  SO  HEAVY  IS 
THAT  THEY  BOTH  PROVIDE  SOLUTIONS  IN  A  GREAT  VARIETY  OF  SITUATIONS') 


1T8 


THE  COMMON  IN6REPIENT  IN  THESE 
QUESTIONS  IS  THIS;  THEY  CAN  BE 
ANSWEREP  BY  COMPAXIM  TWO 
IHPCPCHPCHT  XAHPOM  SAMPLCf. 
ONE  FROM  EWH  OF  TWO 
POPULATIONS 


?Z*X\C\Oi  NO  Pe*Tl£iPE 


ANP.  AT  THE  ENP  OF  THE  CHAPTER. 
WE  LL  LOOK  AT  A  PIFFERBNT  WAY  TO 
COMPARE  TWO  MEANS  THAT  POESNT 
INVOLVE  TAKIN6  TWO  SIMPLE 


159 


Comparing  SUCCESS  RATES 

(or  failure  rates)  for  two  populations. 

WE  8E6IN  WITH  AN  EXPERIMENT.  PART  OF  A  HARVARP  STUPy.  THAT  SOU6 HT  TO 
PEOPE  THE  EFFECTIVENESS  OF  A4FIRIH  IN  REPUCIN6  HEART  ATTACK*.  AS  IN 
MOST  CLINICAL  TRIALS  THE  CHANCES  THAT  ANY  ONE  INPIVIPDAL  b ETS  THE 
PISEASE-IN  THIS  CASE,  A  HEART  ATTACK- 5  VERT  SMALL  IN  ANy  &IVEN  yEAR. 
BUT  WE  WANT  ANSWER*  QUlCKlYI  WHAT  PO  WE  PO? 


THE  SIMPLE.  BUT  EXPENSIVE.  bOLUTlOfi  IS  TO  TEST  A  LAR&E  NUMBER  OF 
INPIVIPUALS  IN  A  SHORT  TIME-  IN  THIS  STUpy.  22,071  SUBJECTS  CALL 
VOLUNTEER  POCTORS)  WERE  RANPOMLT  ASSI&NEP  TO  TWO  MoUP*. 


1*0 


OVER  A  PERIOP  AVERAGING 
NEARLY  FIVE  YEARS",  THE 
INVESTIGATORS  RECORPEP 
THE  RESPONSES.  HEART 
ATTACK  OR  NO  HEART  ATTACK. 
THE  RESULT:  (IN  THE 
NUMBERS  THAT  FOLLOW.  WE 
HAVE  COM BINEP  FATAL  ANP 
NONFATAL  HEART  ATTACKS  ; 


_  ATTACK  NO  ATTALA 

PLACEBO  239  10,795 

ASPIRIN  139  10,096 


n _  ATTACK  RATE _ 

,M"«  A  •  Ss  *  0217 

WJOV  pt  •  •  .01 26 


THE  OBSERVEP  PlFFERENCE 
IN  SUCCESS  RATE  IS 
p, -pt*  0091.  IT  SOUNPS 
SMALL  UNTIL  YOU  LOOK  AT 
THE  RELATIVE  RISK. 


MEMBERS  OF  THE  PLACEBO 
GROUP  WERE  172  TIMES 
LIKELIER  TO  SUFFER  A  HEART 
ATTACK  THAN  THOSE  IN  THE 
ASPIRIN  GROUP 


"THE  STUPY  WAS  STOPPEP  EARLY  BECAUSE  OF  ITS  POSITIVE  OUTCOME  IT  WOULP 
HAVE  BEEN  UNWISE  ANP  IMPRACTICAL  TO  PENY  THE  RESULTS  TO  THE  GROUP 
TAKING  THE  PLACEBO 


161 


The  Model:  the  placebo  anp  aspirin  sroup  observations 

ARE  INPEPENPENT  SAMPLES  FROM  TWO  BINOMIAL  POPULATIONS  FOR 
CONSISTENT.  WE  REFER  TO  A  HEART  ATTACK  AS  A  6U6CC66  (()■ 

PLACEBO  ASPIRIN 

POPULATION  ONE  POPULATION  TWO 

CHANCE  OF  SUCCESS  »  p,  CHANCE  OF  SUCCESS  -  pt 

THE  OBJECTIVE  IS  TO  ESTIMATE  THE  TRUE  PlFFERENCE,  prpt 


FOR  EACH  POPULATION  (ACTUALLY  LAR&E  SAMPLES  OF  THE  GENERAL  POPU 
L AT  ion;,  WE  HAVE  THE  FAMILIAR  RANPOM  VARIABLES 

v  NUMBER  OF  SUCCESSES  v  NUMBER  OF  SUCCESSES 

1  IN  POPULATION  ONE  2  IN  POPULATION  TWO 


^  X,  PROPORTION  OF  ^  X-  PROPORTION  OF 

P  S  _L  SUCCESSES  IN  P  ~  z  SUCCESSES  IN 

I  POPULATION  ONE  2  77^  POPULATION  TWO 

ANP  AN  ESTIMATOR  OF  PlFFERENCE  IN  RATE:  P,-  P2 


IE2 


Sampling  distribution  lor  P,-  P2 


FOR  LAR6E  SAMPLES.  P,-P1 
IS  APPROXIMATELY1 
NORMALLy  PlSTRIBUTEP. 
MUCH  AS  IN  THE  CASE  OF 
ONLY  ONE  SAMPLE  WE  CAN 
MAKE  THE  USUAL  l- 
TRANSFORM  TO  frET  A 
STANPARP  NORMAL  RAN  POM 
VARIABLE  CAPPROXlMATELy; 

Z  '  cr(P,-Pj 

BUT  HOW  PO  WE  FlNP 
THAT  STANPARP  PEVIATlON 
IN  THE  PE NOMINATOR? 


Confidence 
Intervals  for 
P,“P2 


AS  USUAL,  THE  CONFIPENCE  INTERVALS 
FOR  OUR  ESTIMATE  LOOK  LIKE  THIS: 


p\-pi  =  p'Pi  ±  *e(p-pt) 

T  \  t  \ 


TRUE 
PlFF  ERENdE 
OF  POPULATION 
PROPORTIONS 


THE  VARIANCES  Of  P,  AMP  P,  APP,  SO 
THE  STANPARP  ERROR  BECOMES 


*&(?■  -p,) 


\  n.  «, 


fe> 


IN  THE  ASPIRIN  STUPY,  THE  STANPARP 
ERROR  IS 


hypothesis 

testing 

THE  FORMAL  HYPOTHESIS-TESTING 
QUESTION  IS 


H0,  THE  NULL  HYPOTHESIS.  IS  THAT 
ASPIRIN  HAP  NO  EFFECT  py  -  p1 
Ha.  THE  ALTERNATIVE,  IS  THAT 
ASPIRIN  POES  REPUTE  THE  HEART 
ATTACK  RATE-  pt>  Pi 


NOW  WE  NEEP  A  TEST  STATISTS  WITH 
A  NORMAL  PISTRIBUTlON  WHEN  H0  IS 
TRUE--. 


NOTE  THAT  UNPER  H0.  THE  TWO 
PROPORTIONS  ARE  THE  SAME, 
p,-pt-  p-  tO  LET'S  POOL  ALL  THE 
PATA  TO  SET  THE  PROPORTION  OF 
HEART  ATTACKS  IN  BOTH  fAMFLGf 
T06CTHGR; 

~  +  *2 
-  - 


WHEN  THE  NULL  HYPOTHESIS  IS 
TRUE.  THE  STANPARP  ERROR 
PEPENPS  ONLY  ON  THIS  POOLEP 
ESTIMATE:  _ _ 

SE o(P,-Pt) 

ANP  WE  CAN  WRITE  A  TEST 
STATISTIC 

P,-K 

z*  s Zo&-K) 


(THE  NUMERATOR  WOULP 
ORPlNARlLY  BE  ^rK'(P~Pi>- 
WT  Ho  ASSUMES  prpt  *  0.) 


.  __  370 

P  ~  22/771 
swPrf;  -  opi75 


so 


77,  4-  77- 


.PP91 
oes'  Z7P175 


=  5.20 


Z0BS  IS  MORE  THAN  FIVE  iTAWARP  PEVIATIONf  FROM  ZERO,  A  STRONO 
POSITIVE  EFFECT  US  I  NO  A  TABLE  OR  A  COMPUTER,  WE  FINP  THE  P-VALUE 


P-VALUE  -  PRCZ»  Zo»^  -  PR(Z  5.2  }  *  ,oooooo\ 


f  ey  usiuo  a  table, 

A  COMPUTER,  OR  A 

COMPUTER  ON  > 

v  a  me. 


IF  THE  NULL  HWOTHESIS  WERE  TRUE,  THE  PROBABILITY  OF  OBSERVING  AN 
EFFECT  THIS  LAR6E  IS  ONE  IN  TEN  MIUION-NERV  STRON6  EVIPENZE 
AOAINST  Hglll 


The 

general 

recipe: 


THE  RELEVANT  P-VALUE  PEPENPS  ON 
THE  ALTERNATE  HYPOTHESIS: 


TO  TEST  THE  NULL  HYPOTHESIS 


COMPUTE  THE  TEST  STATISTS 


P-VALUE  *  Pr(\Z\  >\lov,\) 
8)  RIOHT  H„  :  P>  P 


[WHERE  SE0  IS  COMPl/TEP  USIN6 
THE  POOLER  PROBABILITY 
OBTAINEP  BY  COMBINING  BOTH 
6ROUPS; 


P-VALUE  »  ?r(Z  > 
l)  LEFT  Wa  ■■  p< p 


P-VALUE  -  Pr(Z<  ZotJ 


SUBJECTS  WERE  RANPOMLY 
ASSI&NEP  TO  TREATMENT 
6ROUPS 


THE  SAMPLE  SIZE  WAS  LAR6E 
ENOU6H  FOR  THE  NORMAL 
APPROXIMATION  TO  WORK 


f,  (?Sl 
v\  sjii 


THE  EXPERIMENT  WAS  BUM- 
SUBJECTS  PIPN'T  KNOW  IF 
THEY  WERE  TAKING  ASPIRIN 
OR  PLACEBO 


POINTS  I  ANP  2  ARE  ESSENTIAL 
PARTS  OF  MOST  HUMAN  CLINICAL 
TRIAL  PESI6NS.  BUT  POINT  ?  IS 
NOT  ESSENTIAL.  600P  SMALL 
SAMPLE  TESTS  PO  EXIST  ANP 
ARE  AVAILABLE  IN  COMPUTER 
SOFTWARE  PACKAGES.  THESE 
HOH?ARA*CmC  PROCEPURES 
PEPENP  ON  SIMPLE,  BUT 
LEN6THY,  PROBABILITY 
CALCULATIONS  SIMILAR  TO  THE 
6AMBLIN6  COMPUTATIONS  WE 
ENCOUNTEREP  IN  CHAPTER  4... 


r  UM  WE  ALSO  x~- 

ASSUME©  THAT  POCTOOS 
kRE  REPRESENTATIVE 
op  the  general  / 

,  Topolatiou.  •  / 


Comparing  the 

MEANS  of  two  populations 


SUPPOSE  WE  WANTEP  TO  COMPARE  THE 
AVERAGE  SALARY  OF  MALE  AMP  FEMALE 
EMPLOYEES  IN  THE  SAME  JOB  AT  SOME 
COMPANY 


POPULATION  ONE  IS  THE  WOMEN,  ANP  POPULATION  TWO  IS  THE  MEN. 


turn 


POPULATION  ONE  HAS  MEAN 
SALARY  ANP  STANPARP 
PEVIATION  c. 


POPULATION  TWO  HAS  MEAN 
SALARY  Mz  ANP  STANPARP 
PEVIATION  Pi 


A  RANPOM  SAMPLE  OF  SIZE  n,  FROM  6ROUP  1  ANP  n2  FROM  6ROUP  2  6IVES 
SAMPLE  MEANS  OF  X,  ANP  X2  ANP  STANPARP  PEVIATlONS  S,  ANP  S,, 
RESPECTIVELY.  THE  ESTIMATOR  OF  MrMz 

X,  -X2 


hypothesis  testing:  WE  ASSESS 

THE  NULL  HyPOTHESlS  THAT  THE  TWO  POPULATION  MEANS  ARE  EQUAL 

H o'-  M\-Mi 

THE  TEST  STATISTIC  IS 

XrXt 

065 '  mxrxt) 


IW 


ANP  THE  P-VALUES  WORK  IN 
THE  USUAL  WA/ 


and  how  about  comparing 

SMALL  SAMPLE 

MEANS? 

REMEMBER  CHAMCLCOH  MOTOR*?  THEIR  COMPETITOR  I6VAUA  AUTO,  CLAIMS 
THAT  IT*  STYROFOAM  H OOU  ORNAMENT  &IVES  BETTER  FRONT  ENP  CRASH 
PROTECTION,  ANP  THEYVE  CRASHEP  SEVEN  I6UAAIA*  TO  PROVE  IT! 

/c^onJcha/aelEjn  1 
[  You  M'  ME1  MATURE 
l  REP  (M  TooTH  amp 
V  CLAW  ^  _ ^ 


THE  t  PISTRIBUTION  CAN  BE  USEP 
IF  BOTH  POPULATIONS  ARE  MOUNP 
SHAPEP  ANP  HAVE  THE  SAME 
STANPARP  PEVIATION  cr.tr,  --<rT 
THE  ONLY  WRINKLE  IS  THAT  WE 
HAVE  TO  POOL  THE  SUM  OF 
SQUARES  ABOUT  THE  MEANS  TO 
FORM  A  SINGLE  ESTIMATE  OF  <r. 

O-  _  (H|-t)4?’  +• 

PoOL_ 


THE  STANPARP  ERROR  IS  THE  SAME  AS 
FOR  LARGE  SAMPLES,  EXCEPT  THAT 
SPCX>_  REPLACES  S,  ANP  S,. 


THE  CONFIPENCE 

INTERVAL  IS 

m,-m,  ---  s,-f  ,±t. 


WHERE  t«  IS  A  CRITICAL  VALUE  OF  t 
WITH  nrnt-2  PEGREES  OF  FREEPOM 


THE  REPTILIAN  CARMAKERS  AGREE  THAT  THEIR  STANPARP  PEVlATlONS  ARE 

CLOSE  ANP  THEIR  REPAIR  HISTOGRAMS  ARE  MOUNP-SHAPEP  THEY  COMPUTE: 

1  1  1W’  L  1-0’ 

Wll - S - “* 

/ok ..foRMt  safety-  \ 

*£(X,-X1)  =■  264^ jr+lj  -  194 

THE  95*  CONFIPENCE  INTERVAL  IS 

[  RUT  you  CN*'T  ARGUE  \ 

VV^TH  BEWTifUL  S Tjl»b  J 

MrMt  - no-v>o ±  tnf(  154; 

dim-' 

»  240  +  (0-25X154; 

'  m*hWL 

»  WO  ±  MO 

since  this  inclupes  the  value  0. 

IGUANA  AUTOS  HAS  HOT  SHOWN  A 
SIGNIFICANT  IMPROVEMENT  IN 

REPAIR  COSTS. 

mlfp 

171 


PAIRED  COMPARISONS 

a  better  way  to  compare  gasolines 


THE  TAXI  OWNER  FOLLOWEP  THE 

cookbook  exa m*  hi*  samples 

WERE  RAN  POM,  ANP  HI*  SAMPLE 
SIZE  WAS  LAR6E  ENOUGH  HE 
JUST  FAILEP  TO  THINK  WHEN 
NEZESSARVI 


ALTHOUGH  6AS  B  APPEARS  TO  BE  SLIGHTLY  BETTER  THAN  6AS  A,  THE 
ZONFIPENZE  INTERVAL  WAS  WIPE  BECAUSE  OF  THE  LAR6E  STANPARP 
PEVIATIONS-IE,  THE  MILEAGES  VARIED  WIDELY  FROM  ONE  CAP  TO  THE 
NEXT.  WHy  SIKH  HI6H  VARIABILITY  BECAUSE  ZABS  ANP  CABBIES  HAVE 
PIFFERENT  PERSONALITIES1 


WE  STILL  RAN  POM  I ZE  THE  TREATMENT  BY  FLIPPING  A  COIN  TO  PECIPE 
WHETHER  TO  USE  6AS  A  ON  TUESPAY  OR  WEPNESPAY  WE  CAN  ALSO  CUT  THE 
EXPERIMENT  POWN  TO  IP  CABS.  SAVING  THE  OWNER  A  LOT  OF  TIME  ANP 
MONEY! 


MEAN  29.20 
STANPARP  PEVIATION  427 


NOTE  THAT  THE  MEANS  ANP  STANPARP  PEVIATIONS  OF  CAS  A  ANP  CAS  B  ARE 
ABOUT  THE  SAME.  THAT'S  TO  BE  EXPECTEP,  SINCE  THEY  HAVE  THE  SAME  SOURCE 
OF  VARIABILITY  AS  IN  THE  UNPAIREP  EXPERIMENT.  BUT  NOW  THE  PIFFEREWCf 
COLUMN  HAS  A  VERY  SMALL  STANPARP  PEVIATION.  THE  PlFFERENCE  COLUMN, 

BY  COMPARING  CAS  PERFORMANCE  WITHIN  A  SINGLE  CAR,  ELIMINATES 
VARIABILITY  BETWEEN  TAXIS. 


50  WE  NAVE  1 04  <  Md  ^  u  WITH  **  FlttHCG,  GOOQ  EVIPEME  THAT 
6A5  B  REALLY  15  BETTER. 


THE  HyPOTHE5l5-TE5TIN6  P -VALUE  ON  BE  FOUNT?  U5IN6  A  50FTWARE 
PACK-A6C: 


««■>**  O 

P  VALUE  -  Pr(\t\  >  Itof^lP 

»  PrCltl>^; 

=  Pr(ltl>  *15) 
-  .01 2  <  .09 


A6AIN,  6A5  B  PA55E5  THE  TE5T. 


HERE  ARE  PLOTS  OF  THE  6AS  MILEA6E  PATA-  THE  FIRST  OWE  SHOWS  THE 
MILEA6ES  UWPAlREP 


ANP  HERE’S  THE  SAME  PATA  PAIREP  BY  TAXICAB 


:n  i/i\  \ 


10  11  14  U  19  10  11  14 


CEREALS  ON  RAN  POM  ORPER).  THE  PAIREP  COMPARISON  REMOVES  THE 
NATURAL  BIAS  OF  THE  TASTER  FOR  OR  AGAINST  CEREAL  IN  GENERAL 


178 


IN  THIS  CHAPTER,  WE  APPUEP  THE 
BA  SIC  I  PEAS  ABOUT  CON  Fl  PENCE 
INTERVALS  ANP  HypOTHESIS 
TESTING  TO  THE  COMPARISON  OF 
TWO  POPULATIONS.  THERE  ARE 
INNUMERABLE  FURTHER  POSSI¬ 
BILITIES  WE  COULP  HAVE  60NE  ON 
TO  PESCRIBE  COMPARISONS  OF 

0  THE  STANPARP 

PEVIATIONS  OF  TWO 
POPULATIONS  WHEN 
SAMPLE  SIZE  IS 
SMALL  > 

0  THE  MEANS  OF 
MORE  THAN  TWO 
POPULATIONS  WHEN 
SAMPLE  SIZE  IS 
LAR&E, 

0  THE  MEANS  OF 
MORE  THAN  TWO 
POPULATIONS  WHEN 
SAMPLE  SIZE  IS 
SMALL, 


'  Tw*  \ 

WHy  ^TaTi4>Tic^  ] 
VOOK*?  ARE  j 
Thi4<-  / 


IN  PRACTICE,  STATISTICIANS  PETERMINE  THE  GENERAL  NATURE  OF  THE 
PROBLEM,  ANP  THEN  CONSULT  THE  RI6HT  REFERENCE  BOOK 


THE  ONLY  THIN6  REALLy  NEW 
IN  THE  CHAPTER  WAS  THE  IPEA 
OF  THE  PAIREP  COMPARISON 
TEST.  IN  THE  NEXT  CHAPTER, 
WE  LL  LOOK  AT  SOME  OTHER 
KINPS  OF  EXPERIMENTAL 
PESI6NS 


It? 


•  Chapter  10* 


IN  THIS  CHAPTER,  WE 
INTRODUCE  THE  BASIC 
(PEAS  Of  EXPERI¬ 
MENTAL  PESI6N, 
WHILE  LEAVING  THE 
PETAILEP  NUMERICAL 
ANALySlS  TO  yOUR 
HANpy  STATISTICAL 
SOFTWARE  PACK 


(W>  fDRMX&tfi 
TwS  CHAng?... 


THE  ELEMENTS  OF  A  PESl&N  ARE  THE  EXPERIMENTAL  UNITS  ANP  THE 
TREATMENTS  THAT  ARE  TO  BE  ASSi&NEP  TO  THE  UNITS.  THE  OBJECTIVE  OF 
ANy  PESI&N  IS  TO  COMPARE  THE  TREATMENTS 


WHEAT  VARIETIES,  PESTICIPES.  FERTILIZERS,  ETC. 


T»1 


TOPAY.  EXPERIMENTAL  PESl&N  I  PEAS 
ARE  USEP  EXTENSIVELY  IN  INDUSTRIAL 
PROCESS  OPTIMIZATION,  MEDICINE 
ANP  SOCIAL  SCIENCE-  ANY  EXPERI¬ 
MENTAL  PESl&N  USES  THREE  BASIE 
PRINCIPLES,  WHlTH  ARE  CLEARLY 
ILLUSTRATE?  IN  OUR  CAB  EXAMPLE 


Replication:  the  same 

TREATMENTS  ARE  ASSl&NEP  TO 
PiFFERENT  EXPERIMENTAL  UNITS 
WITHOUT  REPUTATION,  IT  S 
IMPOSSIBLE  TO  ASSESS  NATURAL 
VARIABILITY  ANP  MEASUREMENT 


60  FAR.  WE  HAVE 
A66UMEP  THAT  EVERy 
PAy  OF  THE  WEEK  16 
THE  6AME.  BUT  WE  iAN 
CONTROL  FOR  THl6. 
TOO,  IN  THE 
FOLLOWING  WAy:  U6E 
ONty  FOUR  CAM,  ANP 
A66I6N  THE 
TREATMENT  A#0RPIN6 
TO  THE  TABLE  AT 
RlfrHTt 


A  FOUR-BY-FOUR  TABLE 
WITH  FOUR  PtFFERENT 
ELEMENTS,  EAIH  APPEARING 
ON£E  IN  EVERY  COLUMN 
ANP  ROW,  IS  aLLEP  A 

Latin  square. 

IN  THIS  EXPERIMENT,  THE 
FOUR  PAYS  ANP  FOUR  £ABS 
6ET  ALL  FOUR  TREATMENTS 
EXACTLY  ON£E 


THE  RANPOMIZATION  STEP 
PIOCS  A  SIN6LE  LATIN  SQUARE 
PESI6N  AT  RANPOM  FROM  A 
LIST  OF  ALL  POSSIBLE  FOUR- 
WAY  LATIN  SQUARES 


EXPERIMENTAL  PESlGNS  ARE  ANALYZEP  BY  ALLOCATING  TOTAL  VARIABILfTy 
AMONG  AFFERENT  SOURCE*.  IN  THE  CAB  EXAMPLE.  THE  SOURCES  OF 
VARIABIUTy  ARE  THE  CAB.  THE  TIRE  MAKE,  GAS  TYPE.  PAY-ANP  RANPOM 
ERROR  ANALYSIS  OF  VARIANCE.  ANOVA  FOR  SHORT,  PARTITIONS  THE  TOTAL 
VARIATION,  ALLOCATING  PORTIONS  TO  EACH  SOURCE. 


IN  THE  NEXT  CHAPTER,  WE  EXPLAIN  IN 
PETAIL  ONE  MOPEL  FOR  ANALYZING 
COMPLEX  PESl&NS-  THE  LINEAR 
REMEfflON  MOPEL.  IN  LINEAR 
REGRESSION.  YOU'LL  BE  ABLE  TO  SEE 
ANOVA  UP  CLOSE  ANP  NUMERICAL 


166 


♦  Chapter  1  1  ♦ 

REGRESSION 


fO  FAR,  WE'VE  PONE  STATISTICS  ON  ONE  VARIABLE  AT  A  TIME,  WHETHER  IT 
Om  FROM  A  POPULATION  OF  PILL  TAKERS,  PICKLES,  OR  CRASHEP  CARS.  IN 
TH/S  CHAPTER.  WE  LL  SEE  HOW  TO  RELATE  TWO  VARIABLE *  6fVEN  THE 
WEIGHTS  OF  THE  92  STUPENTS  IN  CHAPTER  2.  WE  ASK  HOW  THEY  ARE  RELATEP 
TO  THE  STUPENTS'  HEIGHTS. 


THIS  IS  AN  EXAMPLE  OF  A  BROAP  CUSS  OF  IMPORTANT  QUESTIONS:  POES 
SLOOP  PRESSURE  LEVEL  PREPICT  LIFE  EXPECTANCY?  DO  SAT  SCORES 
PREPICT  COLLEGE  PERFORMAHCE ?  DOES  REAPING  STATISTICS  BOOKS  MAKE 
YOU  A  BETTER  PERSON  ? 


FOR  TUI*  CHAPTER.  LET'*  LABEL  TUE  WEi&HT  PATA  A*  ^  ANP  THE  HEI&HT  PATA 
A*  X.  THU*  (X,.  yt)  I*  THE  HEIGHT  ANP  WEI&HT  OF  *TUPENT  i  WE  Pi*PLAY 
THE  POINT*  (X, .  IN  A  2-PlMEN*IONAL  POT  PLOT  CALLEP  A  MATTCRPLOT. 


l^bOm  OF  THE  POT*  ARE  Bl6££R.  BECAl>*£  THEY  REPRE*ENT  TWO  OR  THREE 
5TUPENT*  WITH  THE  *AME  HEIGHT  ANP  WEIGHT  ; 


an  WE  PREPIOT  A  5TUPENTS  WEIGHT  y  FROM  HI*  OR  HER  HEI6MT 

Regression  analysis 


THE  I  PEA  IS  TO  MINIMIZE 
THE  TOTAL  SPREAP  OF  THE 
y  VALUES  FROM  THE  LINE- 
JUST  AS  WHEN  WE  PEFlNEP 
THE  VARIANCE,  WE  LOOK  AT 
ALL  THE  SQUAREP  y 
PISTANZES  FROM  THE  LINE, 

ANP  APP  THEM  UP  TO  GET 
THE  SUM  OF  SQUAREP 
ERRORS 

=  Yjtyrp1 

i-i 

IT'S  AN  AGGREGATE  MEASURE  OF  HOW  MUCH  THE  LINE'S  “PREPICTEP  y,,’ 

OR  y, ,  PIFFER  FROM  THE  ACTUAL  PATA  VALUES  y, . 

The  regression  or 
least  squares  line 

IS  THE  LINE  WITH  THE  SMALLEST  SSE. 


HISTORICAL  NOTE-  WHY  PO  WE  CALL  THIS  PROCEPURE  REGRESSION 
ANALYSIS/  AROUNP  THE  TURN  OF  THE  CENTURA  GENETICIST  FRANCIS 
6ALTON  PISCOVEREP  A  PHENOMENON  CAlLEP  RE  ERE  fit  OH  TOWARP 
THE  MEAN.  SEEKING  LAWS  OF  INHERITANCE,  HE  FOUNP  THAT  SONS' 
HEIGHTS  TENPEP  TO  REGRESS  TOWARP  THE  MEAN  HEIGHT  OF  THE 
POPULATION,  COMPAREP  TO  THEIR  FATHERS'  HEIGHTS  TALL  FATHERS 
TENPEP  TO  HAVE  SOMEWHAT  SHORTER  SONS,  ANP  VICE  VERSA  6ALTON 
PEVELOPEP  REGRESSION  ANALYSIS  TO  STUPy  THIS  EFFECT,  WHICH  HE 
OPTIMISTICALLY  REFERREP  TO  AS  “REGRESSION  TOWARP  MEPIOCRiTV 


NOT  TO  BEAT  AROUNP  THE  BUSH.  WE 
GIVE  WITHOUT  PROOF  THE  REGRESSION 
LINES  FORMULA;  ITS  MESSY  BUT 
COMPUTABLE. 


(HERE  X  ANP  Y  ARE  THE  MEANS  OF 
{*,}  ANP  {^,(  RESPECTIVELY.,) 

BECAUSE  SOME  OF  THESE  EXPRESSIONS  WILL  SHOW  UP  AGAIN.  WE  ABBREVIATE 
THEM: 


to XX  ~  £(*!-*)*  ^  iUM  OF  SQUARES  AROUNP 
V^j,  THE  MEAN.  THESE  MEASURE 


M»>  *  Ety-p* 


THE  SPREAP  OF  *,  ANP  y(. 


,,  _  r--,  THE  CROSS  PROPUCT  PETERMINES 

**  -  V  fwrTMS*,,;  THE  COEFFICIENT  A 


FOR  f  ME  RI&6EP  PAT  A,  MERE'S  TME  WHOLE  COMPUTATION 


x,  v,  (xrx)  forty  (Xrx)1 


WHICH  6IVE5  VALUES  OF  a  ANP  b- 


b-=-  *5  «  -  y-b%-  MO-5 (W>  -  -200 


50  ^  -  -200  + 5# 


$LOPPIiy  TME  PATA  POINT*  ARE  *PREAP  OUT,  I  E.  MOW  BIE  *$£■  1$.  RELATIVE 
TO  TME  TOTAL  *PR£AP  OF  TME  PATA  SOME  EXAMPLES- 


LET'S  QUANTIFY  THIS  BY 
APPORTIONING  THE  VARIABILITY 
IN  V.  REFER  TO  THE  PICTURE  AT 
RIGHT  FOR  GUlPANCE.  WE  LET 

y.t  »  a+bxi 

THUS,  p,  ARE  THE  PREPICTEP 
WEIGHTS  PETERMlNEP  BY  THE 
REGRESSION  LINE- 


ANOVA  table _ _ 

SOURCE  OF  VARIABILITY  SUM  OF  SQUARES  VALUE  FOR  RlGGEP  PATA 


REGRESSION  SSR  *^(£1-p2  4 OOO 


ERROR  SSE  •  4424 

TOTAL  IP, 424 


CBY  THE  WAY.  IT  IS  NOT  OBVIOUS  THAT  SS  *  SSR  +  SSE -BUT  ITS  TRUE.U 
ANYWAY,  HERE  IS  HOW  THE  REGRESSION  AWP  ERROR  SUMS  OF  SQUARES  ARE 
CALCULATEP  FOR  THE  RlGGEP  PATA  SET.  WITH  y  =  -lOO  +  IZ 

REGRESSION  ERROR 


*i  y,  y,  (fr-y)  (y,-y^' 


rr> 


MOW  LET'S  BE 
HONEST  MOBOPy- 
WELL,  ALMOST 
NOBOpy-poes 
THESE  CALCULATIONS 
By  HANP  ANyMORE 
WITH  A  COMPUTER, 
ALL  THIS  WORK  CAN 
BE  PONE  IN  one 

une  of  cove... 


/impact,  THIS  Blm' 
[  BOOK  can  BE  Com 
■“7  PPES^EP  IMTo  THE 
V  I  HEAP  Of  A 

I** 


USINO  THE  MiniTAt  STATISTICAL  SOFTWARE  SySTEM.  PEVELOPEP  AT  PENN 
STATE,  THE  SINGLE  COMMANP  LOOKS  L  KE  THIS- 


197 


STATISTICAL 


FOR  THE  HEIGHT 
Vi.  WEI6HT  MOPEL, 
y  \i  WEIGHT,  X  li 
HEIGHT,  a  AMP  f) 
ARE  UMKMOWM,  AMP 
yOU  CMi  THINK  OF 
fi  AS  THE  RAH  POM 
COMPONENT  OF 
THE  WEI6HT*  y 
FOR  EA£H  VALUE 
OF  HEI6HT  X. 


THE  PISTRIBI/TION  OF  6  IS  IN  FACT  DIFFERENT  FOR  PIFFERENT  VALUES  OF  X- 
5-FOOTERS  VARY  LESS  IN  THEIR  WEIGHT  THAN  4-FOOTERS  NEVERTHELESS,  WE 
NOW  MAKE  A  SIMPLIFYING  ASSUMPTION;  LETS  SUPPOSE  THAT  FOR  ALL  VALUES 
OF  X.  THE  £'S  ARE  INDEPENDENT.  NORMAL.  ANP  HAVE  THE  SAME  STANDARD 
DEVIATION  <7-=  <r(e)  AND  MEAN  p.  =  O 


zoo 


NOW,  GIVEN  THE  MOP EL  Y  -  ft+  fix  +  C,  WE  WANT  TO  PO  AS  WE’VE  PONE 
REPEATEPLy  IN  THE  LAST  FEW  CHAPTERS  TAKE  A  SAMPLE  ANP  USE  IT  TO 
EiVMATE  ft  ANP  p. 


ONE  CAN  SHOW  THAT  THE 
«  ANP  />  WE  GOT  BY  THE 
LEAST -SQUARES  METHOP 
ARE  BLUE  THE  Best 
Linear  Unbiasep 
Estimators  of  a  anp  p 

(WHATEVER  THAT  MEANS// 


•  AAOPEL  LiNE 
RECESSION  LINE 
M-TA  tolNr 


UHCoNpiTiO^N-L'y 

GUARANTEE?.' 


AS  USUAL,  PIFFERENT  SAMPLES  VIELP  PIFFERCNT  COLLECTIONS  OF  PATA. 
WHICH  GENERATE  PIFFERENT  REGRESSION  LINES.  THESE  LINES  ARE 
PliTRIBUTEP  AROUNP  THE  LINE  y  =  ft  +  f}%  +  fi.  OUR  QUESTION  BECOMES 
HOW  ARE  a  ANP  b  PlSTRIBUTEP  AROUNP  a  ANP  /?,  RESPECTlVELy,  ANP  HOW 
PO  WE  CONSTRUCT  EONFIPENEE  INTERVAL*  ANP  TEST  HYPOTHECS i? 


ZO\ 


"  FOR  EALH  PATA  POINT  (X,  y,). 

WE  HAVE 

-  a  +  />%,+-  e, 

WHERE  e,  =  ^ 

THE  y  -PlSTANCE  OF  J(, 

FROM  THE  REGRESSION 

LINE.  THE  e,  ARE  SAMPLE 

VALUE*  OF  e.  ANP  THEY 

GIVE  US  AN  ESTIMATOR  S 

FOR  er(e> 

It er 

i 

(WHY  n-2  IN  THE  PE NOMINATOR?  BECAUSE  WE  HAVE  USEP  UP  TWO  PEGREES 

OF  FREEPOM  TO  COMPUTE  a  ANP  b,  LEAVING  n-1  INPEPENPENT  PIECES  OF 
INFORMATION  TO  ESTIMATE  cr.) 

.  > 

TO  REPEAT,  $  1$  AN  ESTIMATOR  OF  WOW  IWP«.y 
TWT  PAW  PO/A/TS  WILL  BE  SCATTERED 
AROUND  TUB  LINE. 


confidence  intervals 

THE  99%  CONFIPENCE  INTERVALS 
FOR  a  ANP  fi  HAVE  THAT  OLP. 

FAMILIAR  FORM: 

fi=b±tJn9S£(b) 

a  -a  ±  tjn5$£(a) 

WHERE  WE  USE  THE  t  PISTRI8UTION 
WITH  n-2  PE6REES  OF  FREEPOM 
(FOR  THE  SAME  REASON  AS  ABOVE/ 


THE  STANPARP  ERRORS,  HOWEVER,  LOOK  RATHER  UNFAMILIAR.  THEy  ARE 
(WITHOUT  PERIVATION/ 


$e(b)  = 


5  ™  +  -4-4 


ryes.,  looks  ure 

THE  CftHlpt  LACCP 
ALMONP  ToRTE 
FROM  THE  AtycreftY 
,  of  me  pent' 9 
yP&JOMifJATOR 


WHAT  HAPPENEP  TO  OUR  PREVIOUS  IT  WAS  REPLACEP  By  SS„  LIKE  n, 
V>tx  INCREASES  AS  WE  APP  MORE  PATA  POINTS,  BUT  IT  ALSO  REFLECTS  THE 
TOTAL  9VRCAP  OF  TUB  X  PATA.  FOR  EXAMPLE,  IF  ALL  9TUPBHT9  9AMPLBP 
NAP  TUB  4AME  HEIGHT,  WE  WOULP  BE  UNJUSTIFlEP  IN  PRAWIN6  ANY 
CONCLUSION  ABOUT  THE  PEPENPENCE  OF  WEI6HT  ON  HEIGHT.  IN  THAT  CASE, 
SSM»  O.  6IVIN&  b  —  oo  ANP  IHFIUITELY  WIPB  CONFIPENCE  INTERVALS 


203 


MORE  QUESTION* 


WE  PREPICT  Yncw  WITHOUT  MEASURING  IT? 


THE  95*  PREPIOTION  INTERVAL 
FOR  A  NEW  INPIVIPUAL  Ynew 
WITH  OBSERVEP  IS 

^NCW  ®  ^  ^  ^^NCW  ^ 

WHERE 


I1  150 


BOTH  THESE  STANPARP  ERRORS  CONTAIN  A  TER/M 
THAT  GROWS  LARGER  AS  THE  2 -VALUE,  %0  OR 
*N£W  •  FARTHER  FROM  THE  MEAN  VALUE 
WHV  POES  THE  ERROR  INCREASE  FARTHER  FROM 
XI  BECAUSE,  IF  YOU  WIGGLE  THE  REGRESSION 
LINE,  rr  MAKES  MORE  OF  A  PIFFERENCE  FARTHER 
FROM  THE  MEAN!  (REMEMBER,  THE  UNE  ALWAyS 
PASSES  THROUGH  (X.y).) 


LET*  WORK  IT  OUT  FOR  THE 
RI66CP  PAT*  FOR  THE  MCAN 
WCI6HT  WHEN  X^76  IN4HE*. 

WE  HAVE  b  =  -200  ANP  a  =  9. 

THEN 

y  -  -200  + 9(76)  ±  (2.369X29.19) 
»  100  ±  (2.369X29.19)  j  .377 7 
=  100  ±  36.34  POUND* 


THE  E*TIMATEP  WEAN  OF 
6'A’  *TUPENT*  I*  100 
POUNP*,  ANP  WE'RE  95% 
£ONFIPENT  THAT  WERE 
WITHIN  U  POUHP9  OF 
THE  TRUE  MEAN 


FOR  A  new  5TUPENT  WHO1*  6'A',  WE  U*£  OUR  RI66EP  *  AMPLE  OF  NINE 
POINT*  TO  PREPICT  THAT 

Yucw  =  -200  +  9(76)  ±  (2-369X29.19)^  ,+  £+ 

-  100  ±  (2.369X29.91) 


-  100  t  10  POUND* 


WE  TELL  THE 
FOOTBALL 
£CA£H  THAT 
WE'RE  PRETTY 
*UR£  THE 
NEW  6UY 
WEI6H* 

*OM  EWH  ERE 
BETWEEN  IIO 
ANP  290! !! 


*09 


THE  INTERVALS  ARE  PRETTY  TERRIBLE!  W WAT’S  TME  PROBLEM?  THERE  ARE 
TWO  PROBLEMS,  ACTUALLY: 


height 


hypothesis  testing 


THE  COMPLETE  SKEPTIC  MI6HT 
SUGGEST  THAT  THERE  IS  HO 
RCLATIOHfHIP  BETWEEN  HEIGHT 
ANP  WEIGHT  THIS  AMOUNTS  TO 
SAYING  THAT  P~0 


1 

s 

■ 

7 

1  ’  1  ’  '  ■  1  *  1  1  ' 

HAG  HO  EFFECT  OH  ^ 

WE  TAKE  THIS  AS  THE  HULL 
HYPOTHEflf. 

H o  0*0 

IN  THAT  CASE.  THE  TEST  STATISTIC 
t  -  ^ 

sea; 

HAS  THE  t  PISTRIBUTION  WITH 
n-2  PEGREES  OF  FREEPOM 
AS  USUAL,  THE  SIGNIFICANCE  TEST 
PEPENPS  ON  THE  ALTERNATE 
HYPOTHESIS. 


t  >  ta  FOR  Ha  P>  O 
t<ta  FOR  Ma  :  P  <  O 
ltl>lt%l  FOR  Ma  ■  fi 


FOR  THE  RIGGEP  WEIGHT  PATA.  WE 
STRONGLY  SUSPECT  THE  ALTERNATE 
HYPOTHESIS  SHOULP  BE 

H«a>0 

WE  TEST 

*"*'  «5*  m 
=  3.00 

FOR  7  PEGREES  OF  FREEPOM, 
t„  --  1095  SINCE  tcj  >  t*,  ,  WE 
REJECT  THE  NULL  HYPOTHESIS  AT  THE 
a-  05  SIGNIFICANCE  LEVEL  ANP 
CONCLUPE  THAT  THERE  IS  A 
SIGNIFICANT.  POSITIVE  RELATIONSHIP 
BETWEEN  HEIGHT  ANP  WEIGHT. 


Multiple  linear 
regression 


WE  CAN  WE  THE  SAME  BASIC 
/PEAS  TO  ANALYZE 
RELATIONSHIPS  BETWEEN  A 
PEPENPENT  VARIABLE  ANP 
SEVERAL  INDEPENDENT 
VARIABLES'- 


POH'T  -/OO 

/  IT'S  JUST  AM  AFFINE 
r  n-l  DIMENSIONAL  HyPER- 
PlANE  W  d-SPA CE'  / 

V  NOTHIU6  TO  it  i 


^+^,+^4-.-^+6 


FOR  EXAMPLE,  WEIGHT  IS 
PETERMINEP  By  A  NUMBER 
OF  FACTORS  OTHER  THAN 
HEIGHT-  AGE,  SEX,  PIET,  BODY 
TYPE.  ETC 


MATRIX  ALGEBRA  ANP  A  COMPUTER  COMBINE  TO  MAKE  SUCH  PROBLEMS  EASy 
TO  ANALYZE 


Non-linear 

regression 


SOMETIMES  PATA  OBVIOUSLY 
FIT  A  NON-LINEAR  CURVE. 
STATISTICIANS  HAVE  A  BAS  OF 
TRICKS  FOR  USING  LINEAR 
RE&RESS/ON  TECHNIQUES  FOR 
NON-LINEAR  PROBLEMS.  THE 
SIMPLEST  OF  THESE  IS  TO 
WRITE  y  AS  A  POLYNOMIAL 

Y  =  a  +  pjc  +  Pi*1  +  e 

ANP  TREAT  X  ANP  X 1  AS 
INPEPENPENT  VARIABLES  IN  A 
UNEAR  MODEL- 


Regression  diagnostics 

FITTING  A  COMPLEX  MOPEL  TO  PATA  CAN  SOMETIMES  OBSCURE  MANY 
PlFFlCULTIES  WE  USE  REGRESSION  PI  AGNOSTIC  PROCEPURES  TO  UNCCVER  ANY 
LURKING  NASTY  SURPRISES. 


/  PIA&N0SED  A  L, 
6*A?H  PEFOpe,  r 
Ur  PLUPpEsuccr^y 


THE  SIMPLEST  PROCEPURE  IS  TO  PLOT  THE  KSIPUAif  e,  AGAINST  THE 
PRCDItTTOR  y,.  REMEMBER.  THE  ERROR  £  IS  ASSUMEP  TO  BE  INPEPENPENT 
OF  *. 


A  PAUPOM  SCATTERPLOT  INPICATES 
THAT  THE  MOPEL  ASSUMPTIONS 
ARE  PROBABLY  OK. 


ANy  PATTCPH  INPICATES  A 
PEFINTTE  PROBLEM  WITH  THE 
MOPEL  ASSUMPTIONS 


A  TYPICAL  LURKING 
NASTY  SURPRISE  (WHICH 
EXISTS  IN  THE 
HEIGHT/WEIGHT  PATA) 

IS  THAT  ERRORS  ARE 
HeT£ROi£CPA*TIC:  IE, 
THE  SPREAP  OF  e 
INCREASES  AS  V 
INCREASES. 


IN  THIS  CHAPTER.  WE 

HAVE  SUMMARIZEP 

THE  BASIC  IP EAS  ANP 
TECHNIQUES  OF 

REGRESSION 

ANALYSIS.  THE  STUPY 

OF  STATISTICAL 
RELATIONSHIPS 

BETWEEN  VARIABLES- 

THIS  CONCLUPES  OUR 

It*,  ^ 

PETAILEP  PISCUSSION 

/  IN  My  PR0FE6SIONM. 

OF  BASIC  STATISTICAL 

opinion,  yoo'vt 

METHOPS-  IN  OUR 

FINAL  CHAPTER,  WE  LL 
BRIEFLY  REVIEW  A 

FEW  REMAINING 

TOPICS  ANP  ISSUES 

\eNOU6H^ _ / 

♦  Chapter  1  2  ♦ 

CONCLUSION 


THE  BASIC  PRINCIPLES.  TOOLS,  ANP 
CALCULATIONS  COVEREP  IN  THIS  BOOK  CAN 
BE  EXTENPEP  TO  SOLVE  MORE  COMPLEX 
PROBLEMS  HERE'S  A  SIAfCP  fAAPLC  OF 
MORE  APVANCEP  STATISTICAL  METHOPSf 


Statistical  analysis  of 

MULTIVARIATE  DATA 

AN  ASSORTMENT  Of  MULTIVARIATE  MODELS  KELP  TO  ANALyZE  ANP  PISPLAY 
/j-PIMENSlONAL  PATA.  SOME  MULTIVARIATE  TECHNIQUES 

Cluster  analysis 

SEEKS  TO  PIVIPE  THE 
POPULATION  INTO 
HOMOGENEOUS  SUBGROUPS. 

FOR  EXAMPLE,  By  ANALyziNG 
CONGRESSIONAL  VOTING 
PATTERNS,  WE  FlNP  THAT 
REPRESENTATIVES  FROM  THE 
*OVTU  ANP  WCfT  FORM  TWO 
PISTINCT  CLUSTERS. 


Discriminant  analysis 


IS  THE  REVERSE  PROCESS  FOR  EXAMPLE,  A  COLLEGE  ADMISSIONS  OFFICE  MI6HT 
LIKE  TO  FIND  DATA  &IVIN6  ADVANCE  WARNING  WHETHER  AN  APPLICANT  WILL  GO 
ON  TO  BE  A  SUCCESSFUL  6KAPVATG  (DONATES  HEAVILY  TO  THE  ALUMNI  FUND; 
OR  AN  VmVOSHFVL  ONE  (60ES  OUT  TO  DO  600D  IN  THE  WORLD  AND  IS 
NEVER  HEARD  FROM  AEAIN;. 

/'COULDN'T  VIE  FIMp"\ 

SOME  idealistic 

V  ( 


Factor  analysis 

SEEKS  TO  EXPLAIN  HlfrH- 
DIMENSIONAL  DATA  wrTH  A 
SMALLER  NUMBER  OF 
VARIABLES.  A  PSYCHOLOWST 
MAY  SIVE  A  TEST  WITH  IDD 
QUESTIONS,  WHILE  SECRETLY 
ASSUMING  THAT  THE 
ANSWERS  DEPEND  ON  ONLY 
A  FEW  FACTORS: 
EXTROVERSION, 
AUTHORITARIANISM,  ALTRUISM, 
ETC.  THE  TEST  RESULTS 
WOULD  THEN  BE  SUMMARIZED 
USIN&  ONLY  A  FEW 
COMPOSITE  SCORES  IN 
THOSE  DIMENSIONS. 


/OH  A  SCALE  FROM  OME  TO 

[  lou'se  7  b  WtoJEfinS),  4.^ 

\  ALTRUISTIC,  fiM?  2  7  MI1HOPrrN?/Wl 
\TVATS  ybo,  w  A  NUT4UEU. ! 


WEVE  ALREAPY  SEEN  HOW  THE  COMPUTER  HELM  WITH  ANALYSIS  ANP 
ARITHMETIC  THERE  ARE  ALSO  SOME  STATISTICAL  IPEAS  THAT  OWE  THEIR  VERY 
EXISTENCE  TO  THE  COMPUTER- 

Image  analysis 

A  COMPUTER  I  MACE  MIGHT  CONSIST  OF  WOO  BY  I OOO  PIXELS.  WITH  EACH  PATA 
POINT  REPRESENTEP  FROM  A  RANGE  OF  167  MILLION  COLORS  AT  ANY  PIXEL 
STATISTICAL  IMAGE  ANALYSIS  SEEKS  TO  EXTRACT  MEANING  FROM  'INFORMATION' 
LIKE  THIS. 


Resampling 

SOMETIMES.  STANPARP  ERRORS  ANP  CONFIPENCE  LIMITS  ARE  IMPOSSIBLE  TO 
FINP.  ENTER  RESAMPLING.  A  TECHNIQUE  THAT  TREATS  THE  SAMPLE  AS  TROUGH 
IT  WERE  THE  POPULATION.  THESE  TECHNIQUES  GO  BY  SUCH  NAMES  AS 
RANDOMIZATION.  JACKKNIFE.  ANP  ROOTSTRAPPING. 


Bootstrapped  Correlations 


NOTE  THAT  THE  SPREAP  OF  THE  BOOTSTRAP  ESTIMATES  IS  RELATIVELY  SMALL 


DATA  QUALITY 


SEEMINGLY  SMALL  ERRORS  IN 
SAMPLING,  MEASUREMENT,  AND  PATA 
RECORPING  CAN  PLAY  HA VOC  WITH  ANY 
ANALYSIS.  R.  A  FISHCR,  GENETICIST 
ANP  FOUNPER  OF  MOPERN  STATISTICS. 
NOT  ONLY  PESIGNEP  ANP  ANALYZEP 
ANIMAL  BREEPING  EXPERIMENTS.  HE 
ALSO  CLCAHGP  TUG  CA6Gi  ANP 
TGHPGP  TUG  AHIAALf,  BECAUSE  HE 
ANEW  THAT  THE  LOSS  OF  AN  ANIMAL 
WOULP  INFLUENCE  HIS  RESULTS 


BIBLIOGRAPHY 


SOCIETY. 


6ASTWIRTH,  JOSEPH  L.  fTAVfTICAL  KAtOMNfi  IN 
LAW  ANP  POLICY.  VOL.  111.  1988.  SAW  PIE60, 
ACAPEMIC  PRESS  THE  LESAL  WITy  GRITTY,  INCLUPIN6 
JURY  SELECTION  CASES  UKE  THE  OWE  THAT  B66AN 
CHAPTER  9. 

STEERlNfr  COMMITTEE  OF  THE  PHYSICIANS'  HEALTHY 
STUPY  RESEARCH  6ROUP.  "FINAL  REPORT  ON  TUE 
AiPIRIN  COMPONENT  OF  TUE  0N6OINA  PUYflCIANi 
UEALTUY  STUPY."  TUG  NEW  ENELANP  JOURNAL  OF 
MEPICINE.  VOL  Ml.  PP  129-135 


IW  CHAPTER  9.  THE  NONJUPICIAL  COMMENT  OW  POKER  FROM  THE  BENCH  WAS  FROM  AN 
ACTUAL  CASE,  WE  ARE  ASSURER  IN  A  PERSONAL  COMMUNICATION  FROM  PR  JOHN  PE  CAM. 
UNIVERSITY  OF  PENNSYLVANIA 


222. 


^  STATISTICAL  SOFTWARE 

IN  TNI*  BOOK  WE  USED  THE  MINITAB  STATIST! CAL  SOFTWARE  SYSTEM  (MINITAB  INC, 
STATE  COLLESE,  PA)  THE  PENN  STATE  STUDENT  HE'SHT  AND  WEISHT  DATA  IS  FROM  THE 
PHASE  DATA  SET  ON  THIS  SYSTEM  COMPUTER  6RAPHICS  WERE  GENERATED  BY  S-PLDS 
(’STATISTICAL  SCIENCES  INC.  SEATTLE  WA),  ON  A  4»  PC  CLONE  S  IS  SOPHISTICATED  SOFTWARE 
DEVELOPED  By  AT1T  BELL  LABS  FOR  ADVANCED  ANALVSlS  AND  GRAPHICAL  DISPIAVS. 


RyAN,  BARBARA  JOINER.  BRIAN.  AND  RyAN.  THOMAS. 
MINITAB  HANDBOOK,  CPWS-KENT.  BOSTON,  1985)  AND 
TUB  iTUPCNT  CPITIOH  OF  MIHITAB  f ADDISON 
WESLEY)  ARE  FAST.  INEXPENSIVE  INTRODUCTIONS  TO 
STATISTICAL  COMPUTING  MlN/TAB  RUNS  ON  MAIN¬ 
FRAMES.  PC  COMPATIBLES.  AND  MACINTOSH  COMPUTERS 


THERE  ARE  WANT  UlSH  QUALITY  SOFTWARE  PACKASES 
AVAILABLE  FOR  THE  PERSONAL  COMPUTER,  INCLUDING 

PATAPCiK  'DATA  DESCRIPTION.  ITHACA.  NY),  FOR  THE 
MACINTOSH 

iAi  (SAS  INSTITUTE  INC.  CARY,  Ml  fPff  (SPSS  INC. 
CHICAGO,  ILL  AND  BMPP  (BMDP  STATISTICAL  SOFTWARE. 
INC,  LOS  AN  SELES.  CA)  WERE  CRlSINALLY  DESIGNED  FOR 
MAINFRAME  SYSTEMS  AND  NOW  HAVE  MISRATED  TO  THE  PC. 
COMPLETE  WITH  WINDOWS 

iTATMAPHUi  (STATISTICAL  6RAPH1CS  CORP.  PRINCETON, 

NJJ,  FOR  THE  PC 

iTAJYICW  (ABACUS  CONCEPTS.  OAKLAND  CA.)  FOR  THE 
MACINTOSH 

iyfTAT  ( SYSTAT.  INC,  EVANSTON  IL)  HAS  SYSTEMS  THAT 
RUN  IN  ALL  ENVIRONMENTS 


THESE  PACKASES  DIFFER  IN  IMPORTANT  DETAILS,  YOU  NEED  TO  BE  A  SMART  SHOPPER 
WE  RECOMMEND  CHOOSINS  A  SYSTEM  THAT  YOUR  COLLEASUES  HAVE  ALREADY  TESTED. 
FEW  OF  US  ARE  CUT  OUT  TO  BE  STATISTICAL  SOFTWARE  PIONEERS  WHEN  LEARNINS  A 
NEW  SYSTEM.  EXPERIMENT  WITH  SMALL.  FAMILIAR  DATA  SETS  REMEMBER.  THE  MOST 
CXPCNSIVC  PAPT  OF  ANY  foFTWAPC  I*  YOUR  T1MC  THE  CARTOON  RULE  FOR 
LEARNINS  STATISTICAL  COMPUTING  IS  FAMILIARITY  BREEDS  RESULTS 


TRYIN6  TO  LEARN  STATISTICAL  THCORY  AND 
STATISTICAL  COMPUTING  AT  THE  SAME  TIME  IS 
A  LITTLE  LIKE  TRYINS  TO  WALK  AND  CHCW 
6UM  AT  THE  SAME  TIME.  DIFFERENT  SKILLS  AND 
THOUGHT  PROCESSES  ARE  INVOLVED  IN  EACH 
SET  ASIDE  SEPARATE  TIMES  TO  LEARN  THESE 
SUBJECTS.  THEN  BRINS  THEM  TOSETHER  IN 
THIS  WAY.  YOU  CAN  BECOME  A  CHBWIH6, 
WALKING,  COMPUTING,  RCHAMAHCt 
iTATIiTICIAN! 


2T3 


Acceptance  sampling,  ISO 
Addition  rule  for  events,  38-39. 42. 44 
Alternate  hypothesis  (H  I.  140-141. 147  149, 
152-153.  165-166.  See  also  Hypothesis 
testing 

left-handed.  144-145 
relevant,  144-145 
right-handed.  144-145 
two-handed.  144-145 
Analysis  of  variance.  See  ANOVA 
ANOVA  (analysts  of  variance).  186,  193-195, 
table.  194 

Approximate  probability,  60 
Approximation 

binomial.  79-81, 86-88 
continuous.  87-88 

normal.  87-88 

Archery  lessons,  confidence  intervals  and.  1 16-124 
Area  under  the  curve.  64-66 
Arrays.  14-15 

Aspirin  clinical  trials,  160-167.  See  also  Two 
populations  compared 
Astralagi.28 

Average  salary  comparison.  168-169 

See  also  Two  populations  compared 
Average  squared  distance.  22 
Average  vwlue.  15-17 

standard  deviatious  front,  22. 24-25, 168,  171 

Bur  graphs.  1 1 
Bayes.  Joe.  46-50 
Bayes.  Rev.  Thomas.  46-50 
Bayesian.  35 
Bayes  Theorem.  46  50 
Bernoulli,  James.  79 
Bernoulli  trial.  74-75. 78 
sampling  sire  and,  98-100 
Best  linear  unbiased  cstimnlnrs  IBL1  IE),  in 

regression  analysis,  201-202 
Beta  (probability  of  type  II  error),  151-155 
Bias 

in  polls.  126-127 

reducing  natural,  w  ith  paired  comparison.  178 
in  simple  random  sampling,  steps  In  eliminate. 
167 


Binomial  approximation.  79-81. 86-88 
Binomial  coefficient.  76 
multiplication  rule  and.  76 
Pascal's  triangle  and,  77 
Binomial  distribution.  T).  81. 83. 86. 88 
asymmetrical.  82 

calculating,  for  large  values.  79-80 
continuous  density  function  and.  79-80 
mean  of.  78 

samtbrd  normals  and.  82 
variance  of,  78 

Binomial  tlistnbulion  table,  78 
Binomial  probubitny  distribution.  77-78 
Binomial  random  variables,  74  -76. 139-140 
Blocks 

complete  randomized.  184-185 
in  experimental  design.  183-184 
BLUE  (best  linear  unbiased  estimators),  in 
regression  analysis,  201-202 
Bootstrapping.  215-216 
Box  and  whiskers  plot.  2 1 
Brass  lacks.  98  -103 

Categorical  statements.  2 
Central  limit  ilieorem.  106. 128.  169 
fuzzy.  83-88 
problems  with.  107 
Central  value,  14.  Set  alia  Spread 
mean.  15-16 
median.  17-18 
Challenger  (space  Shuttle).  3 
Chameleon  Motors 

comparing  small  sample  means,  170-171 
confidence  intervals.  134  135 
hypothesis  testing  for.  149-150 
Chemoff.  Herman.  2 1 2 
Classical  probability,  35 
Claudius  I.  28 
Cluster  analysis.  212 
Cluster  sampling  design.  95 
Com  toss.  32. 54-55. 58. 60-62. 68-70 
Communication.  218 
Comparing  failure  rales,  160-163 
Comparing  small  sample  means,  170-171 ' 
Comparing  success  sales.  160-163 


168-169 

174-178 
I  block,  184-185 


128-130, 169 
lion  and.  128-130 
ilion  and.  117-119 
used  for.  1 14-1 15.  119 
,sis.  203-206 
1. 130, 171 
i  in.  117.  128-130 
118. 128130 
131-136 


Data 

multivariate,  statistical  analysis  of.  212-213 
order  of.  17 

paired  and  unpaired  compared.  177-178 
properties  of.  59 

tigged,  in  regression  analysis,  189.  191. 
194-195,205-207 

spread  of.  in  regression  analysis,  190-195 
Dura  analysis.  4 
Data  description.  8-26 
Data  display.  212 


Data  points.  11-12. 14-15 
average.  17 
middle.  17 
Data  quality,  217 
Data  summary,  12 
Death  rate.  13 

Decision  table,  two-by-lwo,  152 
Decision  theory,  hypothesis  testing.  151  -155 
Deductive  reasoning.  113 
Degtees  of  freedom.  131-135 

in  comparing  snail  sample  means.  171 
hypothesis  testing  and.  149-150 
de  Merc.  Chevalier.  28-29,  75.  78 
de  Moivie.  Abraham.  79-83. 86-88.  101 
Dependent  random  variable,  in  regression 
analysis.  199-209 

Dependent  variable,  in  regression  analysis,  189 
Dice.  28-15 
loaded.  33 

Discrete  probabilities.  64. 66 
Discrete  random  variables,  63 
Discriminate  analysis.  213 
Dot  plots.  9 
two-diiuensional.  188 

Election  polls.  114-127 

hypothesis  testing  in.  143- 145 
Elementary  outcomes.  30. 32-38 
Emir  levels,  confidence  intervals  anti,  124-127 

hclcmscedastic,  209 

margin  of.  confidence  intervals  and,  1 19. 12 1 
measurement,  experimental  design  and,  183 
random  error  fluctuations.  199-209 
standanl.  See  Standard  emir  ISE) 
sum  of  squared  (SSE).  in  regression  analysis. 

190-195 
type  L  151-154 
type  II.  151-154 
Estimates.  102  103,  107 
Estimating  confidence  intervals.  1 14-127 
Estimators.  102-103 

best  linear  unbiased  (BLUE),  in  regression 

analysis,  201-202 

in  comparing  the  means  of  two  populations. 
168-169 
Events 

addition  rule  for,  38-39. 42. 44 
mutually  exclusive.  39. 42, 44 
probability  of.  3S-37 
repeatable.  35 

rules  foi  outcomes  of.  38-39 
subtraction  rule  for.  39. 44 
Expected  value.  61 

random.  30. 32. 34. 36 
sampling  and.  98-100.  104-105 


275 


txpcnmcni  ( continual I 
weight,  9-12. 16. 18-26 
regression.  1 88-209.  See  also  Regression 
Hxperimenuil  design 

basic  principles.  183 
blocks  in.  183-184 
elements  of,  182-183 
liiur-by  four  table  in.  184  185 
Lnlin  square  in,  1 84- 1 85 
local  control  in.  183 
measurement  error  and.  183 
natural  variability  and.  183-185 
randomisation  in.  183. 185 
replication  in.  183.  185 
total  variability  and.  186 
lixperimental  treatments.  182-183 
bxperimenial  units,  182-183 

Factor  analysis,  213 

Failure  rates,  comparing,  for  tuo  populations. 
160-163 

Htlse  positive  paradox.  46-50 
Fermat.  Pierre  dc,  28-45 
Fisher.  R.  A..  217 

Filling  process,  in  regression  analysis,  189-196 
Fixed  significance  level,  in  hypothesis  testing. 
141-142,  145 

Funr-by-four  table,  in  experimental  design, 
184-185 

Frequency,  relative.  10-1 1 . 35. 57-58, 60 
Frequency  hislograms,  1 1. 57-58 
Frequency  tables,  intervals  in,  10-1 1 

Gallup  Poll.  127 
Gambling,  27-45 
Gasoline  comparisons.  172-173 
experiment  design  and.  182  -186 
paired  comparisons  of.  174-178 
Gosset,  William.  108-109. 131-132 
Graphic  display.  13 
Graphs 

histograms.  See  Hislograms 
probability  distribution.  56-58 

111,,)  See  Alternate  hypothesis 
Hetcroscedaslic  enure.  209 
Hislograms.  13 

probability.  56-58 

relniive  frequency,  1 1.  57-58 
spread  measured  in.  19 
symmetrical.  24-25. 77 
Hite.  Shore.  97 
Holmes.  Sherlock.  113-130 
H„(null  hypotheses).  140-141, 144-145, 

147-150.  152-153,  165-166.  See  also 


Hypothesis  testing 

I  lypotheses.  See  also  Hypothesis  testing 
alternate  (Ha).  140-141,  147-149, 152-153, 
165-166 

left-handed.  144-145 
relevant.  144-145 
right-hjndcd.  144-145 
tsvo-Itandcd.  144  145 
null  (II,,).  140-141.  144-145. 147-150, 
152-153. 165-166 
Hypothesis  testing.  138-139 
decision  I  hcory.  151-155 
degrees  of  freedom  and,  149-150 
fixed  significance  level  in.  141-142. 145 
large  sample 

fur  population  mean.  146-148 
significance  test  for  pmpntlions,  143-  I4S 
tn  paired  comparisons,  176 
popoliiion  mean  and.  146-148. 169 
probability  statement  in.  141-142 
in  regression  analysis,  207 
statistical.  140-142 

Iguana  autos.  170-171 
Increments.  9 
Independence.  71,74 
Simple  rniuknu  sampling  and.  9?  94, 96 
special  mulliplicalion  rale  and.  43-44 
Independent  mechanisms.  71 
Independent  variable.  In  regression  analysis, 
189.  199-209 
Inductive  reasoning.  1 1 3 
Innovation,  2 18 

Inspection  sampling,  significance  test  used  in, 
146-148 
Integral.  66-67 

liucrquunile  range  (IQR),  spread  measured  lit. 
20-21 
Intervals 

confidence.  See  Confidence  intervals  in  a 
frequency  table.  10-11 

IQR  (imcrquorlile  range),  spread  measured  in, 
20-21 

Jackknife,  215-216 

Jury  selection,  racial  bias  in,  138-141 

Large  sample  hypothesis  testing 
rot  population  mean.  146-148 
significance  test  for  proportions.  143-145 
Large  values,  calculating  hinonual  distribution 
for.  79-80 

Latin  square,  in  experimental  design.  184-1 85 
Least  squares  line.  189-190. 208 
Left-handed  alternate  hypothesis.  144-145 
Linear  regression,  in  regression  analysis, 
189-190.208 


Local  control,  in  experimental  design,  183 
Logical  operations.  31 


Margin  of  error,  confidence  intervals  and. '  19. 121 


pie,  170-171 
uul.  128-130. 169. 171 
146-148 
175-176 

and.  128-130. 169 
id.  146-148 
60-61 


regression 
design  and,  183 


Natural  bias,  reducing,  with  paired  comparison. 
178 

Natural  variability 
experimental  design  and.  183-185 
reducing,  with  paired  comparison.  178 
Nightingale,  Florence.  1 3 
Non-linear  regression,  in  regression  analysis.  208 


Normal  approximation.  87-88 
Normal  distribution,  standard,  79-85 
rule  for  computing.  85 
table  to  find.  84-85 

Null  hypothesis  0lo),  140-141, 144-145. 

147-150. 152-153. 165-166.  See  also 
Hypothesis  testing 

Numerical  outcome,  sampling  and.  9S  -100. 
104-105 

Numerical  weight.  32 
Objectives!.  35 

Observed  value  of  1. 149-150 
Observed  »alue  of hypothesis  testing  and. 

144-145.  165-166. 169 
Opportunity  sampling.  97 
Opportunity  sampling  design.  97 
Order  of  data.  17 
Outcomes 

elementary,  30. 32-38, 41 
of  events,  rules  for.  38-39 
numerical,  sampling  and.  98-100,  104-105 
Outliers,  18. 21-23 

Paired  comparisons 
of gasolines,  174-178 
means  in.  175  176 

paired  and  unpaired  data  compared.  177  178 
small  sample  r  test  statistic  for,  176 
Marxian!  deviation  in.  175-176 
Pascal.  Blaise.  29 
Pascal's  triangle.  77 
Personal  probability.  35 
Polls 

bias  in.  126-127 
election.  114-127 
error  levels  tn.  124-127 
Gullup.  127 

hypothesis  testing  in.  143-145 
as  opposed  to  actual  elections.  1 26- 127 
Pollution  monitoring,  probability  of  type  II 
critics  in.  15 1- 155 
Pool  the  sum  of  squares 

Population.  Si  r  nisei  Two  populations  compared 
properties.  59 
pcoponiou.  128-130 
standard  deviation.  59.  62,  80 
Population  mean.  59. 62. 80.  Jeen/rnTwo 
populations  compared 
confidence  intervals  and.  128-1.30.  169 
hypothesis  testing  and.  146-148 
Power  analysis  in  monitoring  programs.  154-155 
Prediction  line.  189 

Predictor  sanable,  in  regression  analysis.  189 
Probabililies.4. 27-51 
approximate.  60 


227 


le.  55-58 
■final.  84  85 
is.  56-58 
ly  of  type  lienors,  151-155 

incut,  in  hypothesis  testing. 

63-64 


fluctuilioits.  in  rcgressl 

is.  199-209 

riincnl.  30. 32. 34. 36 

id.  98-100.  104-105 

n.  215-216 

mol  design.  183. 185 


re  intervals.  1 14-1  IS,  119 
_  esign.  92-94 
snofjumis.  138-141 
a.  53-72 


il.  74-76. 139-140 
discrete,  63 
mean  of.  61. 67-69 
ptohahiliiy  distribution.  55-58 


sampling  and.  98-100. 104-105 
r.  107  109 

variance  of,  62. 67-7 1 
Random  variable  I.  107-109 
Random  walk.  214 
Regression.  187-209 
Regression  analysis 

besl  linear  unbiased  esiimmors  (BLUE)  in. 

201-202 

confidence  intervals  in.  203-206 
correlation  coefficient  in,  196 
dependent  random  variable  in.  199-209 
dependent  variable  in.  189 
fitting  process  in.  189-196 
hypothesis  testing  in.  207 
independent  variable  in.  189, 199-209 
linear  regression  in.  189- 190. 208 
predicting  mean  response  in.  204-206 
predictor  v  unable  in.  1 89 
random  emir  fluctuations  in,  199-209 
(egression  diagnostics  in,  209 
response  variable  in.  1 89 
rigged  data  in,  189.  192.  194  195,  205  -207 
spread  of  data  in.  190-195 
squared  correlation  in.  195 
standard  error  <SH)  without  derivation  in.  203 
statistical  inference  in.  199-209 
studcni  sieight  experiment  and.  188-209 
sunt  ol  squared  emirs  tSSE)  in.  190-195 
sum  of  squared  regression  (SSR)  in.  194-196 
Regression  coeiricienl  sample.  191-192 
Regression  line.  189-190.  208 
Regression  model.  199  202 
Relative  frequency,  10. 35. 60 
Relative  frequency  histoginms.  1 1 . 57-58 
Repeatable  events,  probability  and.  35 
Replication  in  experimental  design,  183,  185 
Resampling,  215-216 

Response  variable  in  regression  analysis,  189 
Right-handed  alternate  hypothesis.  144-145 
Rounding  off.  9 
Round  numbers.  10 

Salk  polio  vaccine.  3 

comparing  small.  170-171 
confidence  intervals  and,  130. 171 
distribution  of.  104-106 
hy  potliesis  testing  lor  the  population  mean. 

146-148 

Sample  probability.  100 
Sample  properties.  59 
Sample  regression  coefficient.  191-192 
Sample  size.  91 
comparing  small.  170-171 
confidence  levels  and.  124-125 
increasing,  124-125 


ZZB 


sample  size  t continued) 
standard  etroc  and,  96-103 
resting  large,  143-148 
Sample  space,  30-3 1 . 33, 4 1 
Sample  variance.  22 
Sampling.  89-109 
acceptance,  150 
random,  95 

independence  and.  92-94. 96 
steps  In  eliminate  bias  in.  167 
used  for  confidence  intervals,  1 14-1 15,  1 19 
random  experiment  and,  98-100, 104-105 
random  variables  and.  98-100,  104  -105 
standan)  deviation  and.  101-103 
Sampling  design 
cluster.  95 
opportunity.  97 
random,  92-94 
simple  random,  92-94 
stratified.  95 
systematic.  96-97 
Sampling  distribution 
of  the  mean,  104-106 
for  proportion  of  successes,  163 
Scatterplots,  188-189 
random,  209 

SD.  See  Standard  deviation 

SE.  See  Standard  error 
Senator  Astute.  See  Election  polls 
Sigma,  16.  See  also  Summary  statistics 
Significance  level 

fixed. 141-142, 145 

in  hypothesis  testing,  141  142, 145. 147  148 
in  scientific  work.  141-142 
Significance  lesl 
for  proportions,  143-145 
used  in  inspection  sampling.  146- 148 
Simple  random  sampling,  92-96.  167.  See  also 
Random  sampling 

Smokc-dclecion,  as  a  decision  Iheory  example, 
151-154 

Special  ilddiliun  rule,  for  mutually  exclusive 

Special  multiplication  rule 
conditional  probability  and,  42-44 
independence  and.  43-44 
Spinning  pointer.  63-64 
Spread,  14 

of  data  in  regression  analysis,  190-192 
sum  of  squared  errors  relative  lo,  193-195 

measures  of,  19-25 

of  probabilities,  67 
variance  in,  22-23 
Spread  distance,  squares  of,  22 
Squared  correlation,  in  regression  analysis.  195 
Squared  distance,  22, 61-62 
Squared  errors,  sum  of  (SSE1  in  regression 


analysis.  190-195 

Squared  regression,  sum  of  (SSR).  in  regression 
analysis.  194-196 

Square  root,  standard  deviation  itcfincd  by,  23 
Squares,  pool  the  sum  of,  171 
SSE  (sum  of  squared  errors) 
in  regression  analysis,  190-195 
relative  In  spread  of  dala.  193-195 
SSR  (sum  of  squared  regression  I.  in  regression 
analysis,  1 94- 1 96 
Standard  deviation  (SD) 
in  comparing  small  sample  means,  171 
in  comparing  the  means  of  iwo  populalions, 

in  confidence  intervals,  1 17, 128-130 
defined  by  square  rod,  23 
from  mean  values.  22. 24-25.  168.  171 
in  paired  comparisons,  175-176 
population.  59. 62. 80 
sampling  and.  101-103, 107 
spread  measures  and.  22 
z-scutes  and.  24-25 
Standard  error  (SE) 
iu  comparing  small  sample  means,  171 
in  comparing  Ihe  means  of  two  populations, 
168 

Standard  error  (SE) 

in  confidence  intervals.  118, 128-130 
sample  size  ami,  98  103 
without  derivation,  in  regression  analysis.  203 
Standard  normal  dislrihuliun,  79-82 
table  for.  84-85 

Statistical  analysis  of  multivariate  dot  a, 
212-213 

Statistical  hypothesis  resting.  140-142, 144-145. 

147-148. 165-166.  169 
Statistical  inference.  4 
in  regression  analysis,  199-209 
Slaiislical  siluanons.  1 58- 1 59 

morality.  13 

summary,  14-26. 148 
Slem-and-leaf  diagram.  12,  18 
Stochastic  random  models.  1 16-1 18 
Stratified  sampling  design.  95 
Student's  t.  See  t-distribution 
Subjectivist.  35 

Subxraclion  rule  for  events.  39. 44 
Successes,  number  of.  75 
Success  rates,  99 

comparing,  for  two  populations,  160-163 
confidence  inicrvals  foe.  164 
in  hypothesis  testing,  143-145 
sampling  distribution  for,  163 
Summary  statistics.  14-26 
in  hyputhesrs  testing.  148 
Summation.  1 6.  See  also  Summary  statistics 


ZZ* 


in  regression  analysis,  190-I9S 
relative  to  spread  of  data,  193-195 
Sum  of  squared  regression  (SSR).  in  regression 
analysis.  194-196 
Sum  of  squares,  pod  the.  171 
Systematic  sampling  design,  96-97 

(-distribution,  107-109 
in  comparing  small  sample  means.  1 7 1 
confidence  intervals  based  on.  131-136 
critical  values  for.  132-136. 150 
hypothesis  testing  and.  149-150 
Teamwork,  2  IK 
Test  statistic 

in  hypothesis  testing,  140-141,  144-145, 
147-148,  165-166. 169 
small-sample  r.  for  paired  comparisons.  1 76 
Time  scries  analysis.  214-215 
(  observed  value,  149-150 
Total  variability 
due  to  the  regression.  194-195 

experimental  design  and.  186 

Tukey.  John.  12,21 
(■values.  See  (-distribution 
Two-by-two  decision  table.  152 
Two-handed  alternate  hypothesis.  144-145 
T wo  populations  compared.  1 58- 1 79.  See  aim 

l’'r  hUt  n 

confidence  intervals  for,  164, 169 
hypothesis  testing.  160-163. 169 
mean  of.  168-169 
model  for.  162 

sampling  distribution  for  proportion  of 
successes.  163 
success  rales,  160-164 
Type  I  errors.  151-154 
Type  II  errors,  151-155 
Typical  value.  14-18.  See  aim  Spread 


Variability 

espenment.il  design  and,  183-185 
reducing,  with  paired  comparison,  1 78 
total 

due  to  the  regression,  194-195 

experimental  design  and.  186 

Variables 

binomial  random,  74-76.  139-140 

continuous  random.  63, 65, 67 

dependent.  189 

dependent  random,  199-209 

discrete  random.  63 

random.  See  Random  variables 

in  regression  analysis.  189, 199-209 

analysis  of.  See  ANOVA 
of  binomial  distribution,  78 
of  continuous  random  variables,  67 
of  random  variables.  62. 67-71 

in  spread.  22-23 
Vertical  scale.  1 1 

Weight  experiment,  Penn  Slate  student.  9-12. 
16. 18-26,  188-209.  See  aim  Regres¬ 
sion:  Regression  analysis 

x-axis.  80 
y-axts,  80 

t-observed  value,  hypothesis  testing  and, 

144-145. 165-166. 169 
t -scores.  standard  deviation  and,  24-25 
( transformation,  84-88.  117-118 


2JO 


ABOUT  THE  AUTHORS 


tHILPREN,  KC*TON  ANP 
AMELIA 


LARfty  60HICK  1$  THE  AUTHOR  OR 
tO AUTHOR  OF  A  NUMBER  OF  BOOK* 
OF  NON-FltTION  tARTOONlN6,  A* 
WELL  A*  THE  FEATURE  *r/£W<SE 
£UW&,  APPEARING  BlMONTHLy  IN 
PMOVER  MAGAZINE.  A  PROP-OUT 
OF  HARVARP  6RAPUATE  *tHOOL  IN 
MATHEMATlt*.  HE  *EEM*  TO  HAVE 
LOME  FULL  tIRtLE-  HE  LIVE*  IN  *AN 
FRANtl*tO  WITH  HI*  WIFE,  Ll*A,  ANP 
TWO  PAU6HTER*.  *OPHlE  ANP  ANNA, 
ANP  HOPE*  TO  UNPER*TANP  LIFE 
WHILE  REMAINING  tHAlNEP  TO  HI* 
PRAWIN6  BOARP. 


If  you  have  ever  looked  for  P-valucs  by  shopping  at  P  mart,  tried  to  watch 
the  Bernoulli  Trials  on  ‘People's  Court,’  or  think  that  the  standard  deviation 
Is  a  criminal  offense  in  six  states,  then  you  need  The  Cartoon  Guide  to 
Statistics  to  put  you  on  the  road  to  statistical  literacy. 

The  Cartoon  Guide  to  Statistics  covers  all  the  central  ideas  of  modem 
statistics:  the  summary  and  display  of  data,  probability  in  gambling  and 
medicine,  random  variables,  Bernoulli  Trials,  the  Central  Limit  Theorem, 
hypothesis  testing,  confidence  interval  estimation,  and  much  more— all 
explained  in  simple,  clear,  and,  yes,  funny  illustrations.  Never  again 
will  you  order  the  Poisson  Distribution  in  a  French  restaurant! 

Ujuiy  Gonkk  is  the  author  or  coauthor  of  four  previous  cartoon  guides  by 
HarperCollins.  Wooucorr  Smith  is  a  statistician  active  In  research 
and  consulting,  and  a  professor  at  Temple  University. 

‘Gonlck  Is  one  of  a  kind.’  —Discover 


t!t  HarperPerennial 

a  Dtvuton  of  HarperColtms  JWu&m 
Cover  deslsn  and  illustration  01993  by  Larry  Gonlck 

US*  S1500 

CANADA  $21 50 


